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Series  Foreword 


Computational  neuroscience  is  an  approach  to  understanding  the  infor¬ 
mation  content  of  neural  signals  by  modeling  the  nervous  system  at  many 
different  structural  scales,  including  the  biophysical,  the  circuit,  and  the 
systems  levels.  Computer  simulations  of  neurons  and  neural  networks  are 
complementary  to  traditional  techniques  in  neuroscience.  This  book  series 

ap¬ 
proaches  to  understanding  information  processing  in  the  nervous  system. 
Areas  and  topics  of  particular  interest  include  biophysical  mechanisms  for 
computation  in  neurons,  computer  simulations  of  neural  circuits,  models 
of  learning,  representations  of  sensory  information  in  neural  networks, 
systems  models  of  sensory-motor  integration,  and  computational  analysis 
of  problems  in  biological  sensing,  motor  control,  and  perception. 

Terrence  J.  Sejnowski 
Tomaso  A.  Poggio 
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This  book  originated  at  a  small  and  informal  workshop  held  in  Decem¬ 
ber  of  1992  in  Idyllwild,  a  relatively  secluded  resort  village  situated  amid 
forests  in  the  San  Jacinto  Mountains  above  Palm  Springs  in  Southern  Cal¬ 
ifornia.  Eighteen  colleagues  from  a  broad  range  of  disciplines,  includ¬ 
ing  biophysics,  electrophysiology,  neuroanatomy,  psychophysics,  clinical 
studies,  mathematics  and  computer  vision,  discussed  "Large  Scale  Mod¬ 
els  of  the  Brain,"  that  is,  theories  and  models  that  cover  a  broad  range 
of  phenomena,  including  early  and  late  vision,  various  memory  systems, 
selective  attention,  and  the  neuronal  code  underlying  figure-ground  seg¬ 
regation  and  awareness  (for  a  brief  summary  of  this  meeting,  see  Stevens 
1993).  The  bias  in  the  selection  of  the  speakers  toward  reseachers  in  the  area 
of  visual  perception  reflects  both  the  academic  background  of  one  of  the 
organizers  as  well  as  the  (relative)  more  mature  status  of  vision  compared 
with  other  modalities.  This  should  not  be  surprising  given  the  emphasis 
we  humans  place  on  "seeing"  for  orienting  ourselves,  as  well  as  the  intense 
scrutiny  visual  processes  have  received  due  to  their  obvious  usefulness  in 
military,  industrial,  and  robotic  applications. 

What  distinguishes  this  volume  from  the  myriad  of  edited  books  on 
brains,  neural  networks,  and  consciousness  that  currently  flood  the  market 
is  the  ambitious — some  would  say  overly  ambitious — attempt  at  construct¬ 
ing  theories  from  the  bottom-up ,  that  is  firmly  based  on  nerve  cells,  their 
firing  properties,  and  their  anatomical  connections.  Such  theorizing  stands 
in  marked  contrast  to  earlier  attempts  by  psychologists — going  back  to  Sig¬ 
mund  Freud — and  by  theorists  in  the  artificial  intelligence  community — 
such  as  David  Marr — to  understand  the  brain  from  a  primarily  psycholog¬ 
ical  or  computational  point  of  view,  that  is  from  the  top-down. 

In  the  top-down  approach,  modules  derived  from  psychological  or  math¬ 
ematical  constraints  are  imposed  onto  the  brain,  without  knowing  or  caring 
to  what  extent  the  nervous  system  actually  implements  such  structures.  A 
case  in  point  is  the  distinction  between  the  various  forms  of  short-lasting 
memories  such  as  iconic,  working,  and  short-term  memory.  It  may  well 
be  that  each  part  of  the  brain  has  some  ability  to  change  its  response  as  a 
function  of  its  previous  history,  without  requiring  the  existence  of  a  small 
number  of  discrete  memory  modules  as  postulated  by  cognitive  science. 


Another  misconception  may  be  the  division  of  the  structure-from-motion 
module  into  two  separate  ones,  one  for  solving  the  correspondence  prob¬ 
lem  and  one  for  deriving  three-dimensional  structure  from  the  moving 
two-dimensional  image. 

While  cognitive,  psychophysical,  and  computational  considerations  are 
obviously  crucial  for  understanding  the  brain — how  else  would  we  even 
know  about  focal  attention  or  the  mathematical  problems  associated  with 
computing  optical  flow — they  are  by  themselves  not  sufficiently  powerful 
enough  to  uniquely  derive,  for  instance,  the  specific  algorithms  underlying 
short-range  motion  perception.  To  achieve  this,  we  need  to  know  about 
direction-selective  cells  in  cortex,  their  distribution  along  the  Vl-MT-MST 
pathway,  and  the  representation  of  velocity  at  the  single  cell  level.  Thus, 
in  the  long  run,  memory,  perception,  and  awareness  can  be  solved  only  by 
explanations  at  the  neuronal  level,  explanations  that  can  be  tested  using  the 
tools  of  electrophysiology  and  imaging,  in  combination  with  psychophys¬ 
ical  and  theoretical  studies.  To  the  chagrin  of  many  a  theorist,  however, 
this  emphasis  on  neuronally  based  models  does  rule  out  a  number  of  se¬ 
ductive,  but  from  the  point  of  view  of  the  brain  irrelevant,  topics  such  as 
Schrodinger's  cat,  quantum  gravity  or  whether  or  not  the  brain  is  a  Turing 
machine. 

Est  ubi gloria  nunc  Babylonia?  After  all  has  been  said  and  done,  what  has 
remained  of  earlier  brain  theories?  In  the  years  since  the  end  of  the  Second 
World  War,  many  interdisciplinary  meetings  dedicated  to  the  experimental 
and  theoretical  study  of  the  brain  have  occurred.  Three  prominent  ones 
were  the  MIT  Endicott  House  Symposium  on  the  "Principles  of  Sensory 
Communications"  in  1959  (Rosenblith  1961),  the  Neurosciences  Research 
Program  work  session  on  "Theoretical  Approaches  in  Neurobiology"  in 
Boston  in  1978  (Reichardt  and  Poggio  1981),  and  the  recent  Dahlem  Work¬ 
shop  on  "Exploring  Brain  Functions:  Models  in  Neuroscience"  that  took 
place  in  a  Berlin  without  a  wall  (Poggio  and  Glaser  1993).  Yet  almost  all 
of  the  theories  and  models  proposed  and  discussed  in  at  least  the  first  two 
volumes  have  fallen  out  of  favor  and  have  ceased  to  be  part  of  the  current 
scientific  debate! 

In  fact,  with  the  exception  of  the  Hodgkin  and  Huxley  model  of  action 
potential  generation  and  propagation  (Hodgkin  and  Huxley  1952),  as  well 
as  the  correlation  model  of  motion  perception  in  beetles  and  flies  (Hassen- 
stein  and  Reichardt  1956),  no  theory  or  model  of  brain  function  has  sur¬ 
vived  its  birth  by  more  than  a  decade  (most  of  these  models,  in  fact,  die  in 
infancy!).  Yet  for  all  their  ephemeral  nature,  models  profoundly  affect  the 
way  we  think  about  the  brain.  For  instance,  the  idea  of  a  Hcbbian  synapse 
whose  strength  increases  during  a  conjunction  of  pre-  and  postsynaptic 
activity  determines  and  shapes  the  LTP  field.  In  visual  perception,  such 
theoretical  notions  as  multiple  spatial  scales,  the  correspondence  problem, 
epipolar  lines,  and  the  aperture  problem  attest  to  the  legacy  of  theories  of 
computational  vision.  Given  the  pre-Copemican  state  of  brain  sciences, 
this  is  the  best  we  can  hope  for:  that  the  models  and  theories  presented 
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in  this  volume  will  shape  and  influence  the  way  we  think  about  the  brain, 
the  mind,  and  the  interactions  among  the  two  in  the  years  to  come. 

We  wish  to  gratefully  acknowledge  the  Office  of  Naval  Research,  which 
has  the  vision  and  foresight  to  fund  such  interdisciplinary  meetings  as  our 
Idyllwild  Workshop,  and  Candace  Hochenedel  for  "sweating  it  out  at  the 
keyboard"  and  converting  all  chapters  into  the  appropriate  dialect  of  LMgX. 
Danke  schon. 

Christof  Koch 

California  Institute  of  Technology 
Pasadena 

Joel  Davis 

Office  of  Naval  Research 
Arlington 
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What  Is  the  Computational  Goal  of  the 
Neocortex? 

Horace  Barlow 


INTRODUCTION 

The  human  species  originated  very  recently  and  has  been  changing  very 
rapidly  Since  the  neocortex  is  the  main  structure  that  enlarged  in  primates 
and  now  makes  us  (for  our  body  size)  the  biggest  brained  of  all  animals,  its 
selective  advantage  is  probably  responsible  for  this  extraordinarily  rapid 
evolution.  Figure  1.1  attempts  to  give  a  perspective  on  all  this  by  dis¬ 
playing  the  history  of  our  species  on  a  cosmic  time  scale,  and  it  shows 
both  that  our  status  has  been  changing  at  a  breathtaking  rate  over  the  past 
10,000  years,  and  that  there  is  now  a  serious  threat  of  overpopulation  of 
the  earth  by  humans.  Does  this  mean  that  the  neocortex  has  done  its  job 
too  well?  And  if  it  has,  is  there  any  alternative  to  further  trust  in  its  sup¬ 
posed  product — rational  action  planned  by  rational  thought — to  avert  the 
overpopulation  threat?  How  the  neocortex  evolved  so  rapidly  and  what 
it  does  are  important  problems. 

This  chapter  starts  by  emphasizing  the  inadequacy  of  the  historical  ac¬ 
count  of  the  evolution  of  the  human  neocortex,  and  the  insufficiency  of  the 
neurophysiological  account  of  it  as  providing  a  processed  representation  of 
the  current  sensory  input.  Next  a  role  for  it  is  suggested  that  combines  and 
reconciles  the  neurophysiologal  view  with  that  of  comparative  anatomists, 
who  have  told  us  that  it  acquires  and  stores  knowledge  of  the  world.  At 
first  these  views  appear  to  be  quite  different,  but  the  hypothesis  that  the 
neocortical  representation  is  specialized  to  facilitate  the  identification  and 
learning  of  new  associations  amalgamates  them.  The  middle  part  of  the 
chapter  sets  out  the  requirements  for  such  a  specialized  representation,  and 
it  is  shown  that  a  working  model  or  cognitive  map  of  the  world  is  entailed 
in  its  production.  This  map  or  model  would  be  used  automatically  in  rep¬ 
resenting  sensory  information,  but  the  knowledge  that  the  code  embodies 
might  also  be  accessible  by  a  different  route  for  imagery  and  recall.  I  think 
the  hypothesis  provides  a  new  and  illuminating  way  of  looking  at  the  key 
role  of  perception  in  mediating  between  sensation  and  learning.  The  last 
part  of  the  chapter  outlines  collaborative  work,  still  incomplete,  prompted 
by  the  hypothesis  and  done  with  A.  R.  Gardner-Medwin  and  D.  J.  Tol- 
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Figure  1.1  The  top  line  shows  the  prominent  events  from  the  big  bang  to  the  present  day 
on  a  linear  time  scale.  The  next  scale  shows  the  prominent  events  in  the  last  l/100th  of 
this  enlarged  100  times,  and  so  on  for  the  next  two  scales.  The  final  scale  also  enlarges  the 
final  1 /100th,  but  places  1990  at  the  center.  The  shaded  area  under  the  curve  gives  total 
human  poulation  estimated  from  Thatcher  (1983).  It  has  currently  reached  a  density  of  25 
per  square  mile,  averaged  over  the  whole  surface  of  the  earth,  oceans,  arctic  wastes,  and  all. 
Human  culture  is  very  recent  and  has  been  accompanied  by  an  explosive  growth  in  human 
population. 


2 


Barlow 


hurst,  respectively.  The  two  questions  were  (1)  How  easy  is  it  to  identify 
the  association  of  reward  or  punishment  with  the  logical  conjunction  of 
two  or  more  active  representational  elements?  This  is  the  "Yellow  Volks¬ 
wagen"  problem  posed  by  Harris  (1980);  Gardner-Medwin  has  shown  that 
this  can  be  done  with  reasonable  efficiency  in  the  case  of  frequently  occur¬ 
ring  conjunctions  in  sparse  representations,  but  rare  conjunctions  in  dense 
distributed  representations  will  be  masked  by  noise  resulting  from  acci¬ 
dental  associations  with  the  separate  constituents  of  the  conjunction.  (2) 
What  features  should  be  directly  represented  by  single  elements  in  order 
to  promote  the  efficient  identification  of  associations?  It  is  argued  that  one 
should  choose  as  primitives  conjunctions  of  active  elements  that  actually 
occur  often,  but  would  be  expected  to  occur  only  infrequently  by  chance. 
Tolhurst  has  done  measurements  on  natural  images  confirming  that  edges, 
which  the  brain  certainly  does  use  as  representational  elements,  are  aptly 
described  as  such  "suspicious  coincidences." 

Inadequacy  of  the  Historical  View  of  Cortical  Evolution 

A  historical  explanation  is  usually  given  for  our  large  brains  and  intel¬ 
lectual  domination  of  the  world.  Our  earliest  mammalian  ancestors,  it  is 
said,  were  ground-living  creatures  with  smell  as  their  dominant  sense,  but 
when  they  colonized  the  trees  smell  became  less  useful,  whereas  sight, 
sound,  and  muscular  dexterity  became  more  important.  Smell  formerly 
dominated  the  forebrain,  and  when  it  lost  its  importance  this  freed  the 
protocortex  for  other  purposes,  so  the  small  regions  previously  devoted 
to  vision,  hearing,  touch,  and  muscular  movement  rapidly  expanded  and 
thus  formed  the  primitive  neocortex.  This  organization  enabled  our  ances¬ 
tors  to  expand  into  new  ecological  niches,  and  the  improved  associative 
power  of  the  new  organ  gave  us  the  intellectual  advantages,  including 
versatility,  insight,  and  adaptability,  that  have  enabled  us  to  dominate  the 
world.  The  outline  of  this  view  dates  back  at  least  to  Elliot  Smith  (1924), 
but  many  details  have  been  added  (Allman  1987;  Jerison  1991). 

This  crude  sketch  does  not  do  justice  to  several  nice  aspects  of  this  story, 
but  it  is  basically  unsatisfactory  because  the  neocortex  appears  to  have  led 
the  evolution  of  mammals,  primates,  and  man,  and  not  to  have  followed 
passively  as  a  result  of  a  series  of  historical  accidents.  What  selective 
advantage  could  the  forebrain,  or  future  neocortex,  provide  that  other 
brain  regions  could  not?  What  is  meant  by  improved  associative  power, 
and  why  should  an  organ  formerly  dominated  by  smell  have  it?  These 
are  the  interesting  questions,  and  the  history  of  man's  evolution  is  not  the 
right  place  to  look  for  the  answers. 

The  supposed  origin  of  neocortex  in  a  region  specializing  in  olfaction 
is  interesting,  but  that  is  a  difficult  fact  to  interpret  and  would  not  make 
a  good  starting  point.  Instead  we  look  at  the  account  of  neocortex  that 
neurophysiology  has  given  us. 
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What  Is  the  Computational  Goal  of  the  Neocortex? 


Inadequacy  of  the  Neurophysiological  Account  of  Neocortex 

As  many  have  recognized,  the  view  of  cortical  function  derived  from  neu¬ 
rophysiology  is  unsatisfactory.  Something  like  60%  of  the  monkey  cortex 
seems  to  be  directly  connected  with  vision  (Van  Essen  and  Maunsell  1980), 
but  so  far  no  one  has  really  tried  to  understand  how  it  does  anything  but 
represent  the  current  visual  scene.  The  same  is  true  in  other  modalities — it 
is  the  function  of  representing  the  current  input  that  has  received  attention. 
But  we  do  not  have  a  homunculus  to  look  at  these  representations:  our  cor¬ 
tex  and  associated  structures  form  the  representation,  look  at  it,  analyze 
it,  store  results  about  it,  use  it,  and  continuously  add  to  it.  Animals  learn 
almost  everything  they  know  through  their  senses,  and  academic  knowl¬ 
edge  apart,  the  same  is  true  for  us;  but  the  means  of  acquisition,  storage, 
and  utilization  of  this  knowledge  have  been  little  thought  about  or  studied, 
and  I  think  it  is  time  to  accept  that  the  neocortex  must  do  more  for  us  than 
merely  represent  the  current  scene. 

A  Hypothesis  about  the  Computational  Goal  of  Neocortex 

The  outstanding  question  is:  How  does  neocortex  give  the  great  selec¬ 
tive  advantage  that  must  lie  behind  our  rapid  evolution?  Herrick  (1926) 
said  that  the  cerebral  cortex  provided  the  'Tiling  cabinets  of  the  central 
executive,"  and  he  also  called  it  the  "organ  of  correlation."  Jerison  (1991) 
summarizes  its  role  as  "knowing  about  the  world."  The  importance  for 
higher  mental  function  of  forming  working  models  and  cognitive  maps 
of  the  world  was  pointed  out  by  Craik  (1943)  and  Tolman  (1948),  and — 
although  heretical  at  that  time — these  ideas  from  psychology  fit  the  view 
from  comparative  anatomy  very  well  and  are  now  widely  accepted.  Tol¬ 
man  was  thinking  primarily  of  representing  the  geographic  layout  of  the 
world,  and  Craik's  working  models  imitated  the  dynamics  of  interactions 
in  the  material  world,  but  as  Humphrey  (1976)  pointed  out,  the  interac¬ 
tions  between  people  are  the  most  complex  and  important  things  we  have 
to  understand,  and  the  cortex  is  therefore  likely  to  be  much  concerned  with 
this  aspect. 

Thus  the  hypothesis  is  that  the  cerebral  cortex  confers  skill  in  deriving 
useful  knowledge  about  the  material  and  social  world  from  the  uncertain 
evidence  of  our  senses,  it  stores  this  knowledge,  and  gives  access  to  it 
when  required.  This  extremely  complex  and  difficult  task  specifies  a  def¬ 
inite  computational  goal  for  neocortex,  providing  a  useful  framework  for 
thinking  about  its  structure,  organization,  and  function.  First  consider  the 
problem  of  acquiring  such  knowledge. 

Information  and  Knowledge 

We  understand  the  problem  of  acquiring  knowledge  of  the  world  better 
now  than  in  Herrick's  day.  It  is  not  a  matter  of  simply  recording  or  video- 
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taping  the  succession  of  messages  from  the  outside  world  that  our  senses 
provide,  but  is  a  much  more  analytic  process.  For  present  purposes  it  is 
convenient  to  distinguish  two  aspects  of  the  stream  of  sensory  data,  in¬ 
formation  and  knowledge.  Information  is  unpredictable,  both  from  previous 
parts  of  the  stream  of  data  and  from  other  parts  of  the  current  stream.  As 
Shannon  told  us  (Shannon  and  Weaver  1949),  all  this  genuine  information 
can  in  principle  be  encoded  on  to  a  channel  of  much  lower  capacity  than 
that  which  is  required  for  the  physical  data  transduced  by  the  sense  or¬ 
gans.  The  structure  and  regularity  in  the  stream  of  data  are  redundancy  in 
terms  of  information  theory,  but  this  part  constitutes  the  knowledge  that  the 
neocortex  must  continuously  acquire  and  use.  Both  parts,  information  and 
knowledge,  are  important  to  the  brain:  it  must  recognize  the  structure  and 
regularity  both  to  distinguish  what  is  new  information  and  to  make  use¬ 
ful  interpretations  and  predictions  about  the  world.  Finding  the  structure 
and  regularity  is  the  analytic  part  of  dealing  with  the  succession  of  sensory 
impressions  that  the  brain  receives,  and  this  is  the  part  that  the  neocor¬ 
tex  performs  better  than  other  brain  structures  according  to  the  current 
hypothesis:  it  gives  meaning  to  the  stream  of  sensory  data. 

The  Salience  of  Structure 

Look  for  a  moment  at  the  top  left  part  of  figure  1.2.  It  consists  of  a  random 
array  of  dots.  Compare  it  with  the  top  right,  where  each  dot  has  been 
paired  at  a  position  symmetric  about  the  center  line.  The  random  parts  of 
these  two  figures  are  identical,  but  this  is  not  obvious.  It  is  the  symmetric 
structure  on  the  right  that  leaps  to  the  eye,  while  the  structureless  array 
means  nothing  to  us — unless  we  look  at  it  long  enough  and  start  to  impose 
structure  on  it,  such  as  faces  or  other  imaginary  forms.  In  the  lower  two 
figures  the  structure  resulting  from  other  pairing  rules  stands  out  equally 
clearly,  and  it  is  pretty  obvious  what  these  pairing  rules  are.  However,  it  is 
not  at  all  obvious  that  a  pairing  rule  is  solely  responsible  for  the  structure 
seen;  it  is  hard  to  believe  that  the  vivid  streaks  and  swirls  result  just  from 
pairs  of  dots,  with  no  longer  concatenations,  but  that  is  the  case  (for  the 
first  description  of  these  figures  see  Glass  1969). 

One  can  detect  structures  of  this  sort  when  they  are  overlaid  by  a  huge 
number  of  completely  randomly  placed  dots  (Maloney  et  al.  1987),  so  the 
suggestion  is  that  our  perceptual  system  grabs  simple  examples  of  world- 
knowledge  of  this  sort  and  uses  them  to  construct  its  representation  of  the 
world.  It  is  plausible  to  suppose  that  mirror  symmetry  and  translational 
symmetry  are  so  abundant  in  our  sensory  diet  that  an  animal  is  certain  to 
encounter  them;  hence  mechanisms  for  detecting  these  forms  of  structure 
will  always  prove  useful,  and  their  universal  provision  by  ontogenetic 
mechanisms  has  selective  advantage. 

But  much  of  the  knowledge  we  acquire  is  not  like  this  at  all;  it  consists 
of  arbitrary  forms  whose  regularity  or  structure  results  simply  from  the 
fact  that  they  occur  often,  or  are  repeatedly  associated  with  reward  and 
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Figure  1.2  (A)  An  array  of  200  randomly  positioned  dots.  (B)  Each  dot  in  the  array  of  A  has 
been  paired  at  a  position  mirror-symmetric  about  the  vertical  midline.  (C)  Each  dot  has  been 
paired  at  a  position  up  and  to  the  left  of  the  original  position.  (D)  Each  dot  has  been  paired 
at  a  position  displaced  radially  and  tangentially  from  the  center.  It  is  the  structure  that  leaps 
to  the  eye,  though  this  is  technically  a  form  of  redundancy ;  the  random  positions  of  the  dots 
contain  much  more  information,  but  the  eye  gives  them  less  prominence. 

gratification.  Each  individual  system  has  to  discover  these  for  itself,  and  we 
spend  our  lives  finding,  storing,  and  using  knowledge  of  these  regularities 
in  our  sensory  diet.  They  range  from  the  often-repeated  experience  of  our 
parent's  smell,  voice,  and  appearance,  through  the  geographic  details  of 
our  environment  and  the  acoustical  specificities  of  our  language,  to  the 
customs,  myths,  and  true  knowledge  of  our  culture.  Much  of  this  process 
of  acquisition  is  fostered  by  teaching,  but  each  individual  brain  has  to  do 
a  lot  of  discovering  for  itself. 

So  far  we  have  been  considering  the  goal  of  the  computations  the  cortex 
performs  on  the  current  sensory  input,  arguing  that  it  prepares  a  represen¬ 
tation  suitable  for  discovering  associative  structure,  and  that  this  process 
entails  storing  world  knowledge.  But  each  cortex  not  only  has  its  own 
experience  and  history,  but  also  an  evolutionary  history.  Evolution  results 
from  natural  selection  acting  on  variants  produced  genetically,  so  perhaps 
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the  pattern  of  variants  produced  by  the  cortex  has  enabled  it  to  excel  in  the 
evolutionary  acquisition  of  world  knowledge. 

Evolutionary  Learning  and  Neocortex 

Most  people  will  accept  the  fact  that  there  is  such  a  thing  as  inherited 
knowledge  of  the  world.  Many  of  the  most  striking  examples  are  found 
in  insects — for  example,  the  yucca  moth  could  not  fertilize  the  yucca  plant 
and  use  its  ovaries  as  incubators  for  its  own  eggs  without  such  knowledge, 
nor  could  the  ichneumon  select  a  particular  species  of  caterpillar  to  lay  its 
eggs  in.  But  it  occurs  in  mammals  too — the  specialized  skills  of  a  retriever 
are  quite  different  from  those  of  a  sheepdog  or  a  greyhound — and  no  one 
doubts  that  these  skills  have  a  large  inherited  component. 

Now  a  characteristic  cannot  play  an  important  role  in  the  evolution  of  a 
species  unless  it  is  controlled  genetically  and  subject  to  genetic  variability. 
Therefore  the  view  that  neocortex  is  responsible  for  our  rapid  evolution 
implies  that  its  function  must  be  controlled  genetically,  for  otherwise  it 
could  not  have  brought  us  to  the  position  we  are  in.  The  full  hypothesis 
must  therefore  be  that  the  neocortex  gives  us  useful  knowledge  of  the 
world  in  two  ways:  not  only  does  it  discover  the  structure  of  its  world 
by  experience  during  its  lifetime,  but  it  also  has  mechanisms,  adapted 
through  the  process  of  genetic  selection,  that  confer  skills  for  doing  this. 
These  mechanisms  and  skills  are  sometimes  highly  specialized  and  amount 
to  inherited  knowledge  of  the  world.  On  this  view  both  extreme  schools  of 
thought  about  the  origin  of  our  mental  powers  are  correct:  the  neocortex 
acquires  knowledge  of  the  world  by  nature  as  well  as  by  nurture,  but 
these  methods  work  toward  the  same  end  rather  than  being  the  mutually 
exclusive  alternatives  that  we  tend  to  think.  For  this  reason  they  can  be 
considered  together  when  trying  to  define  the  computational  goal  of  the 
neocortex. 

REPRESENTATIONS  DESIGNED  FOR  KNOWLEDGE  ACQUISITION 

What  we  know  of  the  neurophysiology  of  neocortex  does  not  at  first  suggest 
that  it  is  concerned  with  acquiring,  storing,  and  utilizing  knowledge  of  the 
world.  Instead,  it  seems  to  form  representations  of  the  current  scene  in  the 
sensory  areas,  and  perhaps  the  motor  area  could  be  thought  of  as  forming 
a  representation  of  current  motor  actions.  But  this  representational  func¬ 
tion  does  not  necessarily  conflict  with  the  hypothesis  about  acquisition  of 
knowledge.  Different  types  of  representation  are  suitable  for  different  pur¬ 
poses,  and  the  cortical  representation  may  be  one  that  is  specially  adapted 
to  facilitate  the  learning  of  new  associations.  It  turns  out  that  storage  and 
utilization  of  knowledge  about  the  world  is  necessary  to  form  such  a  rep¬ 
resentation,  so  the  comparative  anatomists'  view  that  neocortex  provides 
the  filing  cabinets  of  the  central  executive  could  be  nicely  reconciled  with 
the  neurophysiological  facts  about  representation. 
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This  cycle  can  be  repeated 

Figure  1.3  Flow  diagram  for  perception  suggested  by  the  hypothesis  that  the  cerebral  cortex 
creates  a  representation  of  the  current  sensory  scene  that  facilitates  the  identification  of  new 
associations.  To  separate  new  information  from  knowledge  (i.e.,  redundancy)  there  must  already 
be  a  store  of  the  known  structure  and  regularities  found  in  sensory  inputs,  and  this  must  be 
used  to  form  a  model  that  accounts  for  as  much  as  possible  of  the  current  sensory  input; 
what  this  accounts  for  is  then  removed  from  the  input  representation.  Though  we  think  we 
experience  sensory  messages  directly,  what  we  see  corresponds  better  to  the  contents  of  the 
heavily  outlined  boxes. 


Figure  1.3  illustrates  a  flow  diagram  for  perception  according  to  this 
hypothesis.  The  sensory  messages  are  combined  with  a  store  of  knowledge 
of  the  world  to  find  the  best  model  of  the  current  sensory  scene.  This  model 
is  then  compared  with  the  sensory  messages  being  received,  and  those 
parts  that  match  are  removed.  The  residue  represents  the  part  of  the  current 
sensory  input  that  is  unaccounted  for  by  preexisting  knowledge.  Ideally 
this  would  correspond  to  new  information,  together  of  course  with  noise 
of  random  origin.  We  can  be  aware  of  this  residue,  but  the  subjectively 
salient  and  objectively  useful  parts  of  the  sensory  flow  consist  mainly  of 
items  that  have  been  accounted  for  (i.e.,  items  that  have  been  successfully 
modeled),  and  also  new  regularities  or  structure  in  the  parts  that  the  current 
model  does  not  account  for.  This  flow  diagram  has  features  related  to  the 
"matching  response"  of  MacKay  (1955),  the  thalamic  "active  blackboard" 
of  Mumford  (1991, 1992),  and  the  adjusting  feedback  of  Daugman  (1988) 
and  Pece  (1992). 

Changing  the  Code  Stores  Associative  Structure 

The  suggested  operation  can  be  thought  of  in  a  different  way,  as  a  recoding 
to  reduce  redundancy.  The  presence  of  one  type  of  associative  structure 
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Figure  1.4  The  image  on  the  left  was  "whitened"  by  making  the  power  spectrum  of  the 
spatial  Fourier  transform  level,  thus  producing  the  image  on  the  right  which  has  a  much 
narrowed  autocorrelation  function.  The  types  of  statistical  structure  that  occur  at  the  borders 
of  objects  survive  whitening  and  can  be  more  easily  analyzed  and  detected  in  the  absence  of 
the  autocorrelations  that  whitening  removes.  (From  Tolhurst  and  Barlow  1993) 


in  a  body  of  data  makes  it  more  difficult  to  detect  another  type,  so  to 
detect  this  new  type  it  is  desirable  to  recode  the  messages  to  eliminate 
the  first  type.  It  is  certainly  very  often  the  case  that  removing  a  known 
type  of  associative  structure  makes  it  easier  to  identify  a  new  type,  and 
figure  1.4  provides  an  illustration.  The  left  part  is  a  normal  image  and 
thus  has  an  autocorrelation  function  that  extends  over  a  large  fraction  of 
the  whole  image.  The  counterpart  of  this  is  the  great  excess  of  low  spatial 
frequencies  in  the  power  spectrum  of  the  Fourier  transform,  and  these 
can  be  removed  by  applying  an  inverse  spatial  filter  to  make  the  power 
spectrum  level.  This  process  of  "whitening"  eliminates  the  correlations, 
estimated  over  the  whole  image,  between  pairs  of  points  with  any  fixed 
separation,  and  the  result  is  shown  in  the  right  image.  It  is  clear  that  the 
higher  order  structures,  whatever  they  are,  that  correspond  to  borders  and 
edges  survive  and  can  be  more  easily  examined  in  this  image. 

In  outline  then,  the  idea  is  that  associative  structure  one  already  knows 
about  should  be  removed  from  the  data  stream  to  make  it  easier  to  detect 
new  associative  structure.  Knowledge  of  the  old  associations  should  be 
used  to  change  the  code  and  thus  modify  the  representation  so  that  these 
old  associations  are  no  longer  present.  This  is  the  idea  of  recoding  to 
reduce  redundancy  (Barlow  1959;  Watanabe  1960)  or,  if  you  like  a  simpler 
analogy,  it  is  like  calculating  the  regression  that  corresponds  to  an  already 
recognized  correlation  to  make  it  easier  to  find  further  relationships  in  the 
residuals.  Of  course  the  modifications  will  not  generally  be  as  simple  as 
subtracting  out  an  expected  regression,  but  a  set  of  modifications  aimed  at 
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accounting  for  and  reducing  the  known  structure  in  a  set  of  images  would 
constitute  stored  knowledge  about  those  images. 

To  an  outside  observer,  a  system  performing  these  operations  would 
look  like  one  that  constructed  Craik's  working  models  (Craik  1943)  and 
Tolman's  cognitive  maps  of  the  environment  (Tolman  1948),  for  it  would 
show  evidence  of  finding  and  using  the  associative  structure  that  underlies 
such  models  and  maps.  Obviously  this  store  of  knowledge  has  many  other 
potential  uses,  particularly  in  the  processes  of  imagination  and  recall  where 
we  experiment  and  play  with  what  we  know.  Possibly  it  could  be  made 
accessible  in  the  absence  of  sensory  input  by  lowering  neural  thresholds 
in  the  box  marked  "stored  knowledge  about  the  environment"  in  figure 
1.3,  but  this  possibility  cannot  be  pursued  here.  Using  this  knowledge  to 
discount  the  expected  in  the  representation  of  the  current  scene  would  have 
enormous  selective  advantage  by  improving  learning  and  the  acquisition 
of  new  knowledge,  though  it  certainly  seems  wasteful  not  to  use  it  for 
imagination  and  recall  as  well. 

Anything  that  improves  the  appropriateness  and  speed  of  learning  must 
have  immense  competitive  advantage,  and  the  main  point  about  this  pro¬ 
posal  is  that  it  would  explain  the  enormous  selective  advantage  of  the 
neocortex.  Such  an  advantage,  together  with  appropriate  genetic  variabil¬ 
ity,  could  in  turn  account  for  its  rapid  evolution  and  the  subsequent  growth 
of  our  species  to  its  dominant  position  in  the  world. 

Although  these  notions  do  not  obviously  follow  from  the  neurophysio¬ 
logical  facts,  I  think  suggestive  evidence  in  support  can  be  found  from  the 
changes  in  neural  connectivity  that  occur  in  the  sensitive  period  early  in 
the  life  of  cats  and  monkeys  (Hubei  and  Wiesel  1970;  Movshon  and  Van 
Sluyters  1981),  and  in  the  known  phenomena  of  pattern-selective  adap¬ 
tation  discussed  elsewhere  (Barlow  1990,  1991).  There  are  aspects  of  the 
evolution  and  neurophysiology  of  the  cortex  that  we  certainly  do  not  yet 
understand  properly,  and  the  new  hypothesis  can  give  us  a  fresh  viewpoint 
if  we  examine  what  it  requires  in  more  detail. 

Acquiring  Knowledge  from  Representations  of  Features 

Acquiring  knowledge  means  finding  out  about  the  regularities  and  pat¬ 
terns  in  the  sensory  input.  It's  a  vast  task  to  determine  the  associational 
structure  of  the  continuous  stream  of  sensory  messages  that  we  receive, 
and  table  1.1  lists  some  of  the  requirements,  starting  with  the  point  above 
about  the  desirability  of  removing  evidence  for  the  associations  you  al¬ 
ready  know  about.  The  next  items  have  been  discussed  before  (Barlow 
1991)  but  will  be  summarized  below. 

Suppose  that  the  representation  of  the  current  scene  consists  of  reports 
of  features,  of  which  there  can  be  a  wide  variety.  For  instance,  one  of  them 
might  be  a  point  in  the  image  having  a  luminance  value  above  the  mean  for 
the  neighborhood  of  that  point,  and  this  would  correspond  approximately 
to  the  feature  that  causes  the  firing  of  an  on-center  ganglion  cell  in  the  retina. 
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Table  1.1  What  would  make  it  easier  to  identify  new  associations? 


Remove  evidence  of  the  associations  you 
already  know  about 

Make  available  the  probabilities  of  the 
features  currently  present 

Choose  features  that  occur  independently 
of  each  other  in  the  normal  environment 


To  facilitate  detecting  new  ones 

To  determine  chance  expectations 

To  determine  chance  expectations 
of  combinations  of  them 


Choose  "suspicious  coincidences"  as 
features 


To  reduce  redundancy  and  ensure 
appropriate  generalization 


Or  it  might  be  the  occurrence  of  a  visual  pattern  resembling  a  monkey's 
face,  which  would  correspond  to  the  occurrence  of  the  trigger  feature  of  a 
so-called  face  cell  in  inferotemporal  cortex.  Thus  almost  any  representation 
one  can  imagine  can  be  described  as  reporting  the  occurrence  of  features. 

There  must  be  many  levels  in  the  actual  representational  system  in  the 
brain,  and  more  complex  features  are  presumably  represented  at  higher 
levels.  The  first  item  in  table  1.1  suggests  that  recoding  to  take  account 
of  identified  regularities  in  sensory  messages  will  be  an  important  step  in 
progressing  to  higher  levels  in  the  perceptual  system.  But  for  present  pur¬ 
poses  let  us  consider  a  single  level  and  examine  what  is  needed  to  identify  a 
new  association.  We  hope  that  the  repetition  of  this  one  operation  may  lead 
to  a  system  that  identifies  the  complex  associations  that  we  undoubtedly 
use  all  the  time. 

The  Need  for  Prior  Probabilities 

To  identify  new  associations  a  representation  must  do  more  than  just  report 
the  occurrence  of  features:  it  must  also  signal  the  unexpectedness  of  the 
features  reported,  or  at  least  make  this  information  immediately  accessible. 
This  might  be  done  by  adjusting  the  threshold  for  a  unit  so  that,  averaged 
over  a  long  period,  it  fires  once  in  a  particular  period;  when  it  fires,  it  then 
signals  an  event  that  has  a  probability  of  occurring  once  in  that  period. 
Alternatively,  the  number  of  impulses  in  the  volley  signaling  an  event 
might  be  an  inverse  function  of  its  probability  such  as  —  logp.  Either 
of  these  would  appear  as  forms  of  habituation,  which  is  of  course  often 
observed  in  sensory  systems. 

The  reason  prior  probabilities  are  needed  is  obvious:  to  show  that  two 
features  are  associated  one  must  show  that  they  occur  together  at  a  rate  dif¬ 
ferent  from  that  expected  by  chance,  and  to  calculate  this  expected  rate  one 
needs  to  know  the  expected  rates  of  the  constituent  individual  features.  Of 
course  one  also  needs  to  know  how  often  the  features  occur  together,  but 
one  can  justifiably  regard  this  as  a  requirement  of  the  associative  mecha¬ 
nism  itself,  while  it  seems  more  natural  to  suppose  that  the  representation 
is  responsible  for  storage  and  access  to  the  rates  of  the  individual  features. 
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Two  points  may  need  clarifying.  First  we  are  assuming  that  the  prob¬ 
ability  of  occurrence  of  a  single  feature  can  be  estimated  from  its  rate  of 
occurrence  over  some  period  in  the  recent  past.  This  would  not  always  be 
justified,  but  in  some  cases  it  will  be  and  to  let  the  argument  proceed  let 
us  assume  it  is.  Second,  the  features  we  shall  be  dealing  with  will  usually 
have  prior  probabilities  well  below  half;  this  means  that  the  predicted  rate 
of  occurrence  of  joint  features  will  be  low,  and  their  expected  number  may 
be  close  to  zero.  Under  these  circumstances  it  becomes  difficult  to  establish 
a  negative  association,  and  one  must  therefore  look  for  joint  features  that 
occur  more  often  than  expected  by  chance.  That  is  why  they  were  called 
suspicious  coincidences  or  cliches  (Phillips  et  al.  1984;  Barlow  1985),  but 
the  basic  property  is  their  nonaccidental  nature. 

The  Need  for  Independence 

Knowing  and  using  the  unexpectedness  of  features  seem  unavoidable  for 
efficient  associative  learning,  but  there  is  another  highly  desirable  property 
when  detecting  new  associations,  namely  statistical  independence,  in  the 
environment  to  which  the  system  is  adapted,  of  the  features  represented. 
Even  in  a  simple  case,  such  as  finding  a  new  association  between  a  special 
occurrence  such  as  reinforcement  and  an  individual  element  of  the  repre¬ 
sentation,  one  would  start  by  assuming  they  were  independent  to  estimate 
the  expected  number  of  coincidences.  The  alternative  would  be  to  take  ac¬ 
count  of  the  known  correlations,  but  this  would  become  difficult  when 
detecting  associations  between  arbitrary  pairs  of  elements,  and  virtually 
impossible  if  one  wished  to  find  an  association  with  some  logical  function 
of  a  group  of  elements.  In  that  case  one  either  has  to  know  the  associational 
structure  within  that  group,  or  else  one  must  again  assume  independence, 
and  if  one  is  going  to  do  the  latter  it  is  important  to  make  sure  that  the 
events  represented  are  in  fact  as  nearly  independent  as  possible.  While  it 
is  plausible  for  a  representation  to  store  the  rate  of  occurrence  of  its  individ¬ 
ual  elements,  one  cannot  suppose  that  it  stores  the  associational  structure 
of  arbitrary  groups  of  such  elements. 

Are  we  actually  able  to  detect  new  associations  with  logical  functions 
of  representational  elements?  For  simple  functions,  surely  we  can,  and  so 
can  most  animals.  We  learn  to  stop  at  red  traffic  lights  and  not  at  green 
ones,  for  example.  In  this  case  one  might  suppose  that  there  are  differ¬ 
ent  representational  elements  for  red  and  green  lights,  but  it  would  be  a 
great  restriction  on  the  utility  of  a  representation  if  this  was  always  neces¬ 
sary  before  separate  associations  could  be  formed.  Harris  (1980)  brought 
this  out  very  nicely  when  discussing  contingent  adaptation,  for  he  noted 
that  almost  any  contingency  that  had  ever  been  tested  seemed  to  produce 
adaptive  effects.  How  could  this  be,  he  said,  if  contingent  adaptation  re¬ 
quires  neurons  specifically  sensitive  to  each  contingency?  We  might  have 
neurons  signaling  yellowness ,  and  perhaps  Volkswagens,  but  surely  we  can¬ 
not  have  neurons  reserved  for  signaling  yelloio  Volksivagensl  This  problem 
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will  be  considered  again,  but  the  advantage  of  distributed,  as  opposed 
to  grandmother-cell,  representations  results  from  their  supposed  ability 
to  utilize  the  vast  number  of  combinations  of  active  elements,  and  this 
advantage  would  vanish  if,  as  a  result  of  their  prior  probabilities  being 
unavailable  or  grossly  misleading,  one  could  not  form  associations  with 
these  combinations  efficiently. 

Following  the  idea  that  one  should  discount  the  expected,  a  possible 
course  of  action  would  be  to  devise  a  code  in  which  the  elements  occur 
as  nearly  as  possible  independently,  and  some  ways  of  doing  this  have 
been  suggested  elsewhere  (Barlow  1959, 1989;  Barlow  and  Foldiak  1989; 
Hentschel  and  Barlow  1991),  together  with  evidence  that  something  of 
the  sort  may  be  happening  (Barlow  1990).  As  already  pointed  out,  the 
codes  that  are  required  to  obtain  independence  embody  knowledge  about 
the  associational  structure  of  the  environment,  and  an  outside  observer 
watching  behavior  based  on  this  modified  representation  should  suspect 
that  some  kind  of  cognitive  map  or  working  model  of  the  environment  had 
been  constructed. 

Forms  of  Representation 

A  distributed  representation  is  one  in  which  the  features  that  can  be  uti¬ 
lized  effectively  for  further  processing  are  represented  by  combinations 
of  activity  of  the  elements,  rather  than  directly  by  the  activity  of  neurons 
or  elements  specifically  and  selectively  sensitive  to  each  of  these  features. 
The  7-bit  ASCII  code  provides  a  familiar  example  of  a  distributed  repre¬ 
sentation,  and  because  one  must  perform  a  logical  manipulation  on  the 
representational  elements  before  one  can  decide  if  the  represented  fea¬ 
ture  has  occurred,  we  say  that  they  represent  the  features  implicitly  or 
indirectly  rather  than  directly.  Tony  Gardner-Medwin  and  I  have  been 
exploring  a  limitation  on  the  use  of  implicit  representations  for  learning 
(Gardner-Medwin  and  Barlow  1992, 1994). 

The  limitation  arises  as  follows.  Consider  classical  conditioning,  where 
an  initially  neutral  sensory  feature  is  "reinforced"  by  being  presented  re¬ 
peatedly  in  conjunction  with  a  pleasant  reward  such  as  some  food  in  the 
mouth.  When  the  animal  has  identified  the  association,  it  uses  the  initially 
neutral  feature  to  predict  the  reward.  Now  to  determine  whether  there 
is  a  genuine  association  one  must  form  a  2x2  contingency  table  for  the 
feature  and  the  reward,  counting  the  numbers  in  each  box  of  this  table.  If 
the  feature  is  directly  represented  there  is  no  great  conceptual  difficulty  in 
obtaining  all  these  numbers:  assuming  that  knowledge  of  reinforcement  is 
available  everywhere,  then  local  mechanisms  at  the  element  can  estimate 
how  often  the  feature  and  reinforcement  occur  by  themselves,  how  often 
they  occur  together,  and  how  often  nothing  occurs.  A  calculation  equiv¬ 
alent  to  a  chi-squared  test  can  then  be  done  on  these  numbers  to  decide 
if  the  association  is  genuine.  In  the  case  of  implicitly  represented  features 
this  is  not  so  straightforward. 
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The  difficulty  is  that  there  is  no  point  in  the  system  where  all  the  infor¬ 
mation  is  available  to  estimate  the  necessary  numbers.  One  can  imagine 
the  reinforcement  signal  being  available  at  all  the  elements  that  carry  the 
information  telling  one  that  the  feature  has  occurred,  but  one  of  these 
elements  by  itself  is  not  enough  to  determine  whether  that  feature  oc¬ 
curred  or  not:  one  must  evaluate  the  logical  function  using  all  the  ele¬ 
ments  before  one  knows  this.  One  can  postulate  an  element  that  does  this 
logical  decoding,  but  such  an  element  would  directly  represent  the  fea¬ 
ture  and  it  would  no  longer  be  only  implicitly  represented.  What  can  be 
done? 

Even  though  accurate  counts  of  the  required  numbers  are  not  available 
at  any  one  spot,  a  relatively  simple  mechanism  could  collect  together  in¬ 
formation  for  an  innaccurate  estimate.  In  the  logical  representation  of  the 
feature  there  will  be  some  elements  that  are  positively  correlated  with  the 
presence  of  the  feature,  and  others  that  are  negatively  correlated.  The 
appropriately  weighted  sum  of  the  activities  of  these  elements  will  give 
an  indication  of  the  presence  of  the  feature,  and  the  average  over  time 
of  this  measure  can  be  used  to  estimate  how  often  it  occurs.  We  know 
from  the  limitations  of  perceptrons  (Minsky  and  Papert  1969)  that  there 
are  many  logical  functions  that  this  method  will  be  incapable  of  detecting 
correctly,  but  we  also  know  there  are  many  cases  where  it  works  satis¬ 
factorily.  An  approximation  of  this  sort  actually  seems  to  be  involved  in 
most  learning  in  artificial  neural  networks.  We  have  been  trying  to  de¬ 
termine  for  what  types  of  representation  such  learning  will  be  reasonably 
fast  and  efficient,  and  under  what  conditions  it  is  bound  to  be  slow  and 
unreliable. 

Statistical  Efficiency  as  a  Measure  of  Performance 

To  assess  the  merit  of  one  representation  against  another  we  need  a  measure 
of  associational  performance,  and  we  have  used  the  statistical  efficiency 
defined  by  R.  A.  Fisher  (1925)  for  this  purpose.  To  make  any  statistical 
decision  up  to  a  required  standard  of  reliability  a  sample  of  a  certain  min¬ 
imum  size  is  necessary,  but  if  the  method  is  inefficient  a  larger  sample  will 
be  needed  to  obtain  the  same  standard  of  reliability.  Fisher's  efficiency  is 
simply  the  ratio  of  these  two  sample  sizes.  In  our  case  the  sample  size  is 
given  by  the  maximum  number  of  occurrences  or  coincidences  that  could 
occur  in  the  time  for  which  the  counts  were  made,  so  if  the  method  is  in¬ 
efficient  it  will  take  longer  to  determine  that  an  association  is  present,  or 
more  mistakes  will  be  made  if  a  decision  is  made  in  the  same  time.  It  is 
pretty  clear  that  speedy  and  reliable  learning  about  new  causative  factors 
in  the  environment  will  have  high  survival  value,  and  the  statistical  effi¬ 
ciency  attainable  in  a  particular  type  of  representation  gives  a  very  direct 
measure  of  how  useful  it  would  be  for  enabling  an  animal  to  detect  new 
associations  and  acquire  new  knowledge  of  the  world. 
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Explicit  Representation 

So  far  we  have  referred  to  "directly  represented  features/'  where  there  are 
selectively  sensitive  representatational  elements  that  respond  when  and 
only  when  the  feature  is  present,  and  "implicitly  represented  features,"  for 
which  there  are  no  such  selectively  sensitive  elements  but  the  appropri¬ 
ate  logical  analysis  of  the  pattern  of  activity  in  the  whole  representation 
nonetheless  allows  one  to  decide  if  the  feature  is  present.  We  now  intro¬ 
duce  an  intermediate  type,  explicitly  represented  features,  for  which  we 
have  been  able  to  solve  the  problem  of  determining  the  statistical  efficiency 
for  detecting  associations. 

An  explicitly  represented  feature  is  one  whose  presence  can  be  deter¬ 
mined  by  a  simple  logical  operation  performed  on  a  subset  of  the  elements 
in  the  representation,  rather  than  on  the  whole  representation.  So  far  the 
simple  logical  operation  we  have  analyzed  is  the  presence  of  a  particular 
pattern  in  the  elements  of  the  specified  subset,  since  this  seems  both  the 
simplest  and  perhaps  the  most  interesting  case.  It  turns  out  that  inactive 
elements  carry  very  little  information  if  the  representation  is  reasonably 
sparse  (i.e.,  the  average  proportion  of  elements  active  at  any  one  time  is  low, 
say  less  than  10% ),  so  one  need  consider  only  the  active  elements.  Each  of 
these  directly  represents  a  different  feature,  so  the  occurrence  of  the  pattern 
corresponds  to  the  conjunction  or  joint  occurrence  of  certain  specific  fea¬ 
tures.  To  return  to  Harris's  example,  if  one  element  at  a  particular  point  in 
the  visual  field  directly  represented  Volkswagens,  and  another  element  at 
that  position  directly  represented  Yellowness,  Yellow  Volkswagens  would 
not  be  directly  represented,  but  they  would  be  explicitly  represented  by 
the  joint  occurrence  of  the  above  two  elements.  The  question  we  think  we 
have  answered  is  "Under  what  conditions  can  a  representation  in  which 
there  are  Yellowness  (Y)  units  and  Volkswagen  (V)  units,  but  no  Yellow 
Volkswagen  (YV)  units,  nonetheless  be  used  to  detect  efficiently  an  asso¬ 
ciation  with  Yellow  Volkswagens?"  The  answer  is,  however,  more  general 
than  this,  for  it  applies  to  multiple  conjunctions  and  patterns  in  subsets 
with  more  than  a  pair  of  active  elements. 

The  outline  of  the  analysis  is  as  follows.  To  determine  if  an  association  is 
present  between  a  feature  and  reinforcement  ( R )  one  does  a  chi-squared  test 
on  a  2x2  contingency  table  in  which  the  feature  (Y,  V,  or  YV)  is  one  of  the 
variables  and  reinforcement  (R)  is  the  other.  Because  of  the  sampling  errors 
in  the  numbers  in  such  a  table  the  result  will  be  variable,  and  this  variability 
determines  how  large  a  sample  is  required  before  one  can  confidently  assert 
that  an  association  is  present. 

If  there  are  no  YV  units  one  must  look  at  the  2x2  tables  for  Y  vs.  R  and  V 
vs.  R,  and  combine  the  results  to  assess  whether  YV  is  associated  with  R. 
Now  for  each  of  the  Y  vs.  R  and  V  vs.  R  tables  there  is  a  perturbing  factor: 
Yellowness  can  occur  with  reinforcement  even  if  there  is  no  Yellow  Volk¬ 
swagen  present,  and  likewise  for  nonyellow  Volkswagens.  These  intrusive 
extra  occurrences  will  not  bias  the  result  if  one  knows  the  unexpectedness 


15 


What  Is  the  Computational  Goal  of  the  Neocortex? 


of  y  and  V,  but  they  will  add  to  the  variability  of  the  two  subtables,  and 
even  when  optimally  combined  the  decision  about  the  association  of  rein¬ 
forcement  with  Yellow  Volkswagens  cannot  be  made  as  reliably  as  if  they 
were  represented  directly. 

Sparse  Coding  Helps 

How  serious  is  this  factor?  Note  first  that  the  problem  arises  when  the 
same  representational  element  is  active  in  more  than  one  of  the  features 
that  may  be  reinforced.  In  the  current  example,  the  Y  unit  is  active  for  all 
yellow  things,  not  just  yellow  Volkswagens,  and  similarly  for  the  V  unit. 
The  extent  of  this  overlap  depends  on  the  sparseness  of  the  representation, 
which  is  defined  by  the  average  fraction  of  the  elements  that  are  active.  If 
it  is  very  sparse,  then  only  a  small  proportion  of  the  units  will  be  active  for 
any  input,  and  there  will  be  little  overlap.  Indeed,  if  it  is  sparse  enough 
there  will  be  only  a  single  unit  active  for  each  input,  and  each  will  therefore 
be  directly  represented.  On  the  other  hand  if  it  is  dense,  a  given  unit  will 
be  active  for  a  high  proportion  of  inputs,  and  the  overlap  problem  will  be 
serious. 

When  there  is  overlap,  what  matters  is  the  number  of  these  intrusive 
extra  occurrences  relative  to  the  number  of  genuine  occurrences  of  Yel¬ 
low  Volkswagens,  and  this  in  turn  depends  upon  the  probability  of  the 
joint  event  (YV)  relative  to  the  single  features  (Y  and  V).  If  Yellowness 
and  Volkswagens  are  both  common,  then  the  reinforcement  of  rare  Yellow 
Volkswagens  would  be  masked  by  the  quite  frequent  chance  reinforcement 
of  other  yellow  things  and  other  colored  Volkswagens. 

One  can  show  that  the  efficiency  for  detecting  a  feature  X  depends  to 
a  good  approximation  upon  the  value  of  a  parameter  Tx  that  is  equal  to 
axpxZ/{a),  where  ax  is  the  fraction  of  representative  elements  in  the  subset 
active  for  the  representation  of  the  feature  X,  (a)  the  average  fraction  active 
for  all  inputs,  px  is  the  probability  of  the  feature  X,  and  Z  is  the  number 
of  representative  elements  in  the  subset.  Figure  1.5  shows  how  efficiency 
varies  with  the  value  of  this  parameter. 

Improbable  Features  Need  Denser  Representation 

Note  first  that  efficiency  increases  with  T,  and,  as  expected,  T  increases 
with  the  number  of  neurons  in  the  subset  and  the  average  sparseness  (low 
(a)).  It  also  increases  both  with  the  probability  px  of  the  feature  X,  and 
with  the  activity  ratio  a*  for  the  feature  X;  in  fact  these  two  are  reciprocally 
related,  so  a  very  rare  feature  X  can  still  form  associations  efficiently  if 
it  causes  an  unusually  large  proportion  of  the  units  in  a  subset  of  the 
representation  to  be  active.  Clearly  there  is  scope  here  for  genetic  factors 
to  improve  selectively  the  performance  of  a  learning  network:  factors  of 
biological  importance  should  cause  many  units  in  a  learning  network  to 
become  active.  Another  way  of  putting  this  is  to  say  that  many  of  the 
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Figure  1.5  The  curve  shows  how  statistical  efficiency  for  detecting  associations  with  a  feature 
X  varies  with  the  value  of  a  parameter  defined  as  follows:  T  =  axpxZ/ (a),  where  ax,  (a)  are 
the  activity  ratio  for  feature  X  and  the  average  activity  ratio,  px  is  the  probability  of  X,  and  Z 
is  the  number  of  neurons  in  the  subset  under  consideration.  For  instance,  one  could  identify 
an  association  with  any  one  of  the  45  possible  pairs  of  active  neurons  in  a  subset  of  10  with  an 
efficiency  of  50%  provided  that  the  neurons  were  active  independently,  the  pair  caused  two 
neurons  to  be  active,  the  probability  of  the  pair  occurring  was  0.1,  and  the  average  fraction 
active  was  0.2.  (From  Gardner-Medwin  and  Barlow  1994) 

directly  represented  features  should  correspond  to  features  possessed  by 
biologically  important  objects;  then,  when  one  of  these  objects  appears,  it 
will  cause  a  high  level  of  activity. 

Next  look  at  the  actual  efficiencies  attainable  for  various  values  of  I\ 
Although  one  needs  r  to  be  10  or  100  for  efficiencies  in  the  90%  to  100% 
range,  useful  efficiencies  of  about  50%  are  obtained  with  r  1;  this  is  the 

order  of  magnitude  of  the  efficiency  of  human  subjects  detecting  bilateral 
symmetry  in  dot  patterns  such  as  those  shown  in  figure  1.2  (Barlow  and 
Reeves  1979).  Consider  a  subset  of  10  elements  in  a  network;  if  one  could 
specify  10  mutually  exclusive  features,  the  elements  of  the  subset  could 
each  handle  one  of  them  and  associations  with  them  could  be  formed  with 
100%  efficiency.  Now  suppose  that  the  features  of  interest  do  not  all  cause 
firing  of  only  single  elements  among  the  ten.  If  a  particular  feature  X 
does  cause  firing  of  just  one  element  (a*  =  0.1)  but  this  element  is  also 
active  in  conjunction  with  other  elements  when  other  features  occur,  then 
if  px  =  (a)  =  0.1  we  will  have  r  =  1  and  the  efficiency  for  detecting  asso¬ 
ciations  will  have  dropped  to  around  50%.  This  reduction  occurs  because 
intrusive  or  accidental  reinforcement  occurs  in  conjunction  with  the  activ¬ 
ity  of  any  given  element,  but  this  is  a  small  price  to  pay  for  the  increased 
versatility  resulting  from  the  possibility  of  using  and  learning  associations 
of  combinations  of  the  features,  as  illustrated  below. 

Suppose  the  feature  X  is  represented  by  the  conjunction  of  two  elements 
(ax  =  0.2).  If  again  px  =  0.1  but  we  suppose  (a)  is  now  0.2,  the  same  as 
the  new  value  of  ax,  then  V  still  has  the  value  1,  corresponding  to  the  same 
efficiency,  ca.  50%.  There  are  45  such  conjunctions  of  pairs  of  elements 
among  10  elements,  so  a  much  wider  range  of  features  can  be  used  to  form 
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associations  efficiently,  and  there  is  not  an  enormous  loss  of  efficiency  com¬ 
pared  with  the  direct  representation  of  features  on  single  elements.  Notice 
that  the  above  applies  to  features  represented  by  pairs  of  active  units,  but 
a  particular  merit  of  such  a  system  is  that  it  can  form  associations  with 
patterns  containing  three  or  more  active  elements.  Even  if  such  multiple 
conjunctions  of  directly  represented  features  are  rare,  provided  that  they 
cause  activity  in  a  high  proportion  of  elements,  they  will  be  learned  with 
reasonable  efficiency. 

What  this  shows  is  that  it  is  possible  to  learn  about  explicit  conjunctions 
of  any  number  of  elements  in  known  subsets  of  a  representation,  provided 
that  the  representation  is  sparse,  provided  that  these  conjunctions  do  not 
occur  too  infrequently  and  activate  a  substantial  proportion  of  elements 
when  they  do,  and  provided  that  the  representative  elements  can  be  con¬ 
sidered,  a  priori,  to  occur  independently.  How  to  achieve  this,  and  read 
out  the  results  in  a  useful  way,  cannot  be  gone  into  here. 

The  analysis  we  have  done  so  far  is  only  a  beginning.  What  can  be  done 
using  the  union  rather  than  conjunction  of  representational  elements  in  a 
subset?  What  can  be  done  with  threshold  logic  functions  on  the  activity  of 
members  of  a  subset?  We  do  not  know  the  answer  to  these  questions,  but 
one  point  does  seem  evident. 

We  have  already  seen  that  the  features  that  are  directly  represented 
should  (1)  occur  as  closely  as  possible  independently  of  each  other  in 
the  environment  to  which  the  representative  system  is  adapted;  (2)  oc¬ 
cur  sufficiently  frequently  so  that  the  representation  is  neither  too  dense 
nor  too  sparse.  These  requirements  might  not  be  too  difficult  to  meet  if  one 
could  postulate  an  indefinite  number  of  directly  represented  features,  but 
such  an  indefinitely  large  representation  would  have  none  of  the  capac¬ 
ity  to  generalize  sensibly  that  is  needed  in  a  representation  to  be  used  for 
learning.  This  introduces  another  requirement  for  the  selection  of  directly 
represented  features:  they  must  each  represent  as  much  as  possible  of  the 
incoming  stream  of  data  from  the  environment  and  must  occur  frequently 
so  that  generalization  occurs  usefully.  Some  tests  of  this  prediction  on 
digitized  images  of  natural  scenes  will  now  be  described. 

SELECTION  OF  DIRECTLY  REPRESENTED  FEATURES 

It  is  generally  agreed  that  the  neurons  of  the  primary  visual  cortex  respond 
selectively  to  the  borders  and  edges  of  objects  in  the  visual  image.  There  is 
argument  about  whether  they  should  be  regarded  as  edge-detectors,  Gabor 
filters,  or  wavelet  functions,  but  there  is  no  disagreement  that  they  do  in 
fact  respond  to  the  oriented  patterns  of  light  that  occur  at  the  borders  of 
objects.  If  the  arguments  (Barlow  1985)  about  the  importance  of  nonchance 
associations  are  correct,  then  measurements  of  the  distribution  of  light 
at  the  borders  of  objects  should  show  that  edges  qualify  as  "suspicious 
coincidences."  We  set  out  to  test  whether  this  was  so,  and  the  main  result 
confirms  that  it  is  (Barlow  and  Tolhurst  1992). 
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We  took  a  selection  of  digitized  images  and  removed  the  correlations 
between  pairs  of  points,  averaged  over  the  whole  image,  by  the  "whiten¬ 
ing"  process  described  before  and  shown  in  figure  1.4;  this  leaves  behind 
the  image  structures  we  are  now  interested  in  that  occur  at  the  borders  of 
objects.  The  distribution  of  pixel  values  in  such  whitened  images  gives 
us  the  basis  for  the  chance  expectation  of  combinations  of  pixel  values, 
and  what  the  hypothesis  says  is  that  at  the  borders  of  objects  we  shall  find 
combinations  of  pixel  values  that  occur  more  frequently  than  this  chance 
expectation. 

Perhaps  it  is  already  obvious  by  inspection  of  the  whitened  figure  that 
this  is  the  case,  for  you  would  not  expect  to  find  by  chance  the  rows  of 
high  or  low  values  you  can  see  in  figure  1.4.  To  confirm  this  we  measured 
the  distribution  of  the  sum  of  nine  pixel  values  selected  at  random  from 
all  over  the  whitened  image  to  provide  the  chance  distribution,  and  from 
nine  adjacent  spots  in  a  row  to  show  what  actually  occurs.  Figure  1.6 
shows  the  result:  the  distributions  are  strikingly  different.  For  the  sum 
of  nine  randomly  selected  pixels  the  range  is  from  about  980  to  1320  on 
the  horizontal  scale,  but  as  you  can  see  values  outside  this  range  are  very 
common  for  the  sum  of  nine  pixels  in  a  row. 

Do  these  extreme  values  occur  at  the  borders?  Yes  they  do,  as  shown  in 
figure  1.7,  which  marks  the  positions  in  the  image  where  these  extremes 
occur.  As  you  see  they  occur  at  edges. 

Is  this  a  consistent  feature  of  the  sum  of  pixels  in  a  line?  To  answer  this  we 
looked  at  a  varied  selection  of  15  images,  and  estimated  the  kurtosis  excess 
for  the  sums  in  a  line  compared  with  randomly  selected  pixels  and  sums 
over  square  regions.  This  measure  (Weatherbum  1961)  is  based  on  the 
fourth  moment  and  values  greater  than  0  can  be  crudely  taken  to  indicate 
that  the  distribution  has  an  excess  of  extreme  values  compared  with  a 
gaussian.  As  shown  in  table  1.2,  the  kurtosis  excess  is  much  greater  for  the 
line  sum  than  for  the  other  distributions,  though  it  has  to  be  said  that  we 
do  not  understand  why  patches,  and  even  single  pixels,  also  show  kurtosis 
excess.  The  large  excess  for  lines,  combined  with  the  fact  that  the  extreme 
values  occur  at  the  borders,  vindicates  the  hypothesis  that  the  features  we 
use  to  represent  an  image  are  suspicious  coincidences — at  least  in  the  case 
of  the  orientationally  selective  units  of  VI. 

SUMMARY  AND  CONCLUSIONS 

It  was  suggested  initially  that  we  dominate  the  world  because  we  know 
more  about  it  than  other  animals,  and  that  it  is  the  neocortex  that  is  re¬ 
sponsible  for  this.  How  to  acquire  and  store  knowledge  of  the  world  is  a 
vast  problem,  but  although  we  have  only  scratched  the  surface  we  may  be 
beginning  to  discover  how  the  neocortex  could,  as  the  combined  result  of 
genetic  selection  and  individual  experience,  provide  us  with  a  represen¬ 
tation  of  the  current  scene  that  automatically  stores,  gives  access  to,  and 
adds  to  such  knowledge. 
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Figure  1.6  Distributions  of  the  sum  of  nine  randomly  selected  pixels  (top),  and  nine  pixels 
in  a  line  at  four  orientations  ( bottom )  from  the  right  hand  (whitened)  image  of  figure  1.4.  The 
lower  distribution  has  an  excess  of  extreme  values — that  is,  values  unexpected  on  the  basis  of 
the  distribution  of  individual  pixel  values  in  the  whitened  image.  (Tolhurst  and  Barlow  1994) 

1.  The  first  principle,  suggested  diagrammatically  in  figure  1.3,  is  that 
neocortex  removes  associative  structure  that  has  already  been  identified 
through  past  experience.  This  is  analogous  to  discounting  the  mean  lumi¬ 
nance  in  light  adaptation,  or  removing  a  known  regression  when  trying  to 
make  sense  of  residual  deviations.  Identified  structure  would  be  stored, 
and  when  recognized  in  the  current  scene  it  would  form  part  of  a  matching 
model;  the  unmatched  residue  would  contain  new  information  about  that 
scene.  Stored  knowledge  of  the  associative  structure  of  the  world  would 
be  used  continuously  and  automatically  in  this  way,  but  there  could  be 
other  methods  of  accessing  it  for  purposes  of  imagery  and  recall. 
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Mean  ±  SE 


Sum  of  pixels  over  line  9x1 

15.82  ±  3.66 

Sum  of  9  random  pixels 

0.989  ±  0.76 

Sum  over  square  3x3 

7.30  ±  1.68 

Single  pixels 

8.81  ±  1.70 

2.  To  prove  that  an  association  between  two  features  exists  you  need  to 
know  their  individual  frequencies  of  occurrence,  because  you  must  es¬ 
timate  the  chance  frequency  of  joint  occurrence  to  show  that  the  actual 
frequency  is  significantly  greater. 

3.  To  detect  associations  with  combinations  of  features,  the  features  should 
be  chosen  so  that  they  occur  as  nearly  as  possible  independently  of  each 
other  in  the  environment  to  which  the  sytem  is  adapted,  for  otherwise 
the  expected  frequency  of  occurrence  of  a  combination  is  hard  to  deter¬ 
mine. 

4.  In  distributed  representations,  detecting  associations  with  conjunctions 
of  features  is  difficult  because  accidental  associations  with  the  constituents 
of  the  conjunctions  mask  associations  with  the  conjunctions  themselves. 
This  problem  tends  to  make  the  identification  of  associations  with  conjunc- 


Figure  1.7  The  left  panel  is  the  whitened  image  from  figure  1.4.  In  the  right  panel  white 
dots  are  placed  at  the  positions  of  the  upper  extreme  values  of  the  distributions  of  the  sum 
of  pixels  in  a  line  shown  in  figure  1.6,  and  dark  dots  at  lower  extreme  values.  The  extreme 
values  obviously  occur  at  the  borders  of  objects  in  the  image.  Hence  the  combinations  of 
pixel  values  that  occur  at  edges  are  ones  that  would  not  be  expected  by  chance. 


Table  1.2  Kurtosis  excess  for  15  images 
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tions  inefficient  in  dense  distributed  representations,  and  such  representa¬ 
tions  are  therefore  unlikely  to  be  useful. 

5.  On  the  other  hand  in  sparse  distributed  representations  associations  can 
be  identified  reasonably  efficienctly  (say  50%)  with  features  represented  by 
the  conjunction  of  directly  represented  features,  provided  that  these  con¬ 
junctions  are  not  too  sparsely  represented  and  occur  with  a  frequency  not 
too  far  below  that  of  the  directly  represented  features. 

6.  To  generate  a  reasonably  economical  representation  of  the  current  scene 
the  directly  represented  features  should  be  suspicious  coincidences — com¬ 
binations  of  signals  from  lower  levels  that  occur  frequently  but  would  be 
rarely  expected  by  chance.  The  representational  elements  at  higher  levels 
should  be  matched  to  the  biological  importance  and  statistical  structure  of 
occurrences  at  lower  levels. 

7.  This  notion  that  the  representational  elements  used  by  the  brain  corre¬ 
spond  to  suspicious  coincidences,  or  combinations  of  simpler  events  that 
occur  more  often  than  expected  by  chance,  has  received  some  support  from 
the  statistical  analysis  of  edges  in  images. 

8.  The  features  that  are  directly  represented  at  any  level  in  the  hierarchy 
will  have  a  strong  effect  on  the  performance  of  a  representational  network, 
including  the  way  that  generalization  occurs.  The  selection  of  these  fea¬ 
tures  is  likely  to  be  one  way  that  genetic  factors  exert  their  influence.  In 
addition,  ontogenetic  control  of  the  connections  between  levels  probably 
determines  the  way  that  information  of  different  types  is  segregated  and 
brought  together  according  to  nontopographic  principles.  In  these  two 
ways,  and  possibly  others,  genetic  factors  must  influence  how  the  cortex 
handles  sensory  information,  and  they  can  be  regarded  as  an  inherited 
store  of  world  knowledge;  the  genetic  variability  that  has  enabled  such  a 
store  to  be  formed  may  be  at  least  as  important  as  the  ability  of  the  cortex 
to  acquire  world  knowledge  by  its  own  direct  experience. 
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A  Critique  of  Pure  Vision1 

Patricia  S.  Churchland,  V.  S.  Ramachandran,  and 
Terrence  J.  Sejnowski 


INTRODUCTION 

Any  domain  of  scientific  research  has  its  sustaining  orthodoxy.  That  is, 
research  on  a  problem,  whether  in  astronomy,  physics,  or  biology,  is  con¬ 
ducted  against  a  backdrop  of  broadly  shared  assumptions.  It  is  these  as¬ 
sumptions  that  guide  inquiry  and  provide  the  canon  of  what  is  reasonable — 
of  what  "makes  sense."  And  it  is  these  shared  assumptions  that  constitute 
a  framework  for  the  interpretation  of  research  results.  Research  on  the 
problem  of  how  we  see  is  likewise  sustained  by  broadly  shared  assump¬ 
tions,  where  the  current  orthodoxy  embraces  the  very  general  idea  that  the 
business  of  the  visual  system  is  to  create  a  detailed  replica  of  the  visual 
world,  and  that  it  accomplishes  its  business  via  hierarchical  organization 
and  by  operating  essentially  independently  of  other  sensory  modalities  as 
well  as  independently  of  previous  learning,  goals,  motor  planning,  and 
motor  execution. 

We  shall  begin  by  briefly  presenting,  in  its  most  extreme  version,  the 
conventional  wisdom.  For  convenience,  we  shall  refer  to  this  wisdom 
as  the  Theory  of  Pure  Vision.  We  then  outline  an  alternative  approach, 
which,  having  lurked  on  the  scientific  fringes  as  a  theoretical  possibility,  is 
now  acquiring  robust  experimental  infrastructure  (see,  e.g.,  Adrian  1935; 
Sperry  1952;  Bartlett  1958;  Spark  and  Jay  1986;  Arbib  1989).  Our  charac¬ 
terization  of  this  alternative,  to  wit,  interactive  vision ,  is  avowedly  sketchy 
and  inadequate.  Part  of  the  inadequacy  is  owed  to  the  nonexistence  of  an 
appropriate  vocabulary  to  express  what  might  be  involved  in  interactive 
vision.  Having  posted  that  caveat,  we  suggest  that  systems  ostensibly  "ex¬ 
trinsic"  to  literally  seeing  the  world,  such  as  the  motor  system  and  other 
sensory  systems,  do  in  fact  play  a  significant  role  in  what  is  literally  seen. 
The  idea  of  "pure  vision"  is  a  fiction,  we  suggest,  that  obscures  some  of 
the  most  important  computational  strategies  used  by  the  brain.  Unlike 
some  idealizations,  such  as  "frictionless  plane"  or  "perfect  elasticity"  that 
can  be  useful  in  achieving  a  core  explanation,  "pure  vision"  is  a  notion 
that  impedes  progress,  rather  like  the  notion  of  "absolute  downness"  or 
"indivisible  atom."  Taken  individually,  our  criticisms  of  "pure  vision"  are 
neither  new  nor  convincing;  taken  collectively  in  a  computational  context, 
they  make  a  rather  forceful  case. 


These  criticisms  notwithstanding,  the  Theory  of  Pure  Vision  together 
with  the  Doctrine  of  the  Receptive  Field  have  been  enormously  fruitful  in 
fostering  research  on  functional  issues.  They  have  enabled  many  programs 
of  neurobiological  research  to  flourish,  and  they  have  been  crucial  in  getting 
us  to  where  we  are.  Our  questions,  however,  are  not  about  past  utility,  but 
about  future  progress.  Has  research  in  vision  now  reached  a  stage  where 
the  orthodoxy  no  longer  works  to  promote  groundbreaking  discovery? 
Does  the  orthodoxy  impede  really  fresh  discovery  by  cleaving  to  outdated 
assumptions?  What  would  a  different  paradigm  look  like?  This  chapter  is 
an  exploration  of  these  questions. 

PURE  VISION:  A  CARICATURE 

This  brief  caricature  occupies  one  corner  of  an  hypothesis-space  concerning 
the  computational  organization  and  dynamics  of  mammalian  vision.  The 
core  tenets  are  logically  independent  of  one  another,  although  they  are  often 
believed  as  a  batch.  Most  vision  researchers  would  wish  to  amend  and 
qualify  one  or  another  of  the  core  tenets,  especially  in  view  of  anatomical 
descriptions  of  backprojections  between  higher  and  lower  visual  areas. 
Nevertheless,  the  general  picture,  plus  or  minus  a  bit,  appears  to  be  rather 
widely  accepted — at  least  as  being  correct  in  its  essentials  and  needing 
at  most  a  bit  of  fine  tuning.  The  approach  outlined  by  the  late  David 
Marr  (1982)  resembles  the  caricature  rather  closely,  and  as  Marr  has  been  a 
fountainhead  for  computer  vision  research,  conforming  to  the  three  tenets 
has  been  starting  point  for  many  computer  vision  projects.2 

1 .  The  Visual  World.  What  we  see  at  any  given  moment  is  in  general  a  fully 
elaborated  representation  of  a  visual  scene.  The  goal  of  vision  is  to  create  a 
detailed  model  of  the  world  in  front  of  the  eyes  in  the  brain.  Thus  Tsotsos 
(1987)  says,  "The  goal  of  an  image-understanding  system  is  to  transform 
two-dimensional  data  into  a  description  of  the  three-dimensional  spatio- 
temporal  world"  (p.  389).  In  their  review  paper,  Aloimonos  and  Rosenfeld 
(1991)  note  this  characterization  with  approval,  adding,  "Regarding  the 
central  goal  of  vision  as  scene  recovery  makes  sense.  If  we  are  able  to 
create,  using  vision,  an  accurate  representation  of  the  three-dimensional 
world  and  its  properties,  then  using  this  information  we  can  perform  any 
visual  task"  (p.  1250). 

2.  Hierarchical  Processing.  Signal  elaboration  proceeds  from  the  vari¬ 
ous  retinal  stages,  to  the  LGN,  and  thence  to  higher  and  higher  cortical 
processing  stages.  At  successive  stages,  the  basic  processing  achievement 
consists  in  the  extraction  of  increasingly  specific  features  and  eventually 
the  integration  of  various  highly  specified  features,  until  the  visual  system 
has  a  fully  elaborated  representation  that  corresponds  to  the  visual  scene 
that  initially  caused  the  retinal  response.  Pattern  recognition  occurs  at  that 
stage.  Visual  leaning  occurs  at  later  rather  than  earlier  stages. 

3.  Dependency  Relations.  Higher  levels  in  the  processing  hierarchy  de¬ 
pend  on  lower  levels,  but  not,  in  general,  vice  versa.  Some  problems  are 
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early  (low  level)  problems;  for  example,  early  vision  involves  determining 
what  is  an  edge,  what  correspondences  between  right  and  left  images  are 
suitable  for  stereo,  what  principle  curvatures  are  implied  by  shading  pro¬ 
files,  and  where  there  is  movement  (Yuille  and  Ullman  1990).  Early  vision 
does  not  require  or  depend  on  a  solution  to  the  problems  of  segmentation 
or  pattern  recognition  or  gestalt.3 

Note  finally  that  the  caricature,  and,  most  especially,  the  'Visual  world" 
assumption  of  the  caricature,  gets  compelling  endorsement  from  common 
sense.  From  the  vantage  point  of  how  things  seem  to  be,  there  is  no  deny¬ 
ing  that  at  any  given  moment  we  seem  to  see  the  detailed  array  of  whatever 
visible  features  of  the  world  are  in  front  of  our  eyes.  Apparently,  the  world 
is  there  to  be  seen,  and  our  brains  do  represent,  in  essentially  all  its  glory, 
what  is  there  to  be  seen.  Within  neuroscience,  a  great  deal  of  physiolog¬ 
ical,  lesion,  and  anatomical  data  are  reasonably  interpretable  as  evidence 
for  some  kind  of  hierarchical  organization  (Van  Essen  and  Anderson  1990). 
Hierarchical  processing,  moreover,  surely  seems  an  eminently  sensible  en¬ 
gineering  strategy — a  strategy  so  obvious  as  hardly  to  merit  ponderous 
reflection.  Thus,  despite  our  modification  of  all  tenets  of  the  caricature,  we 
readily  acknowledge  their  prima  facie  reasonableness  and  their  appeal  to 
common  sense. 

INTERACTIVE  VISION:  A  PROSPECTUS 

What  is  vision  for?  Is  a  perfect  internal  recreation  of  the  three-dimensional 
world  really  necessary?  Biological  and  computational  answers  to  these 
questions  lead  to  a  conception  of  vision  quite  different  from  pure  vision. 
Interactive  vision,  as  outlined  here,  includes  vision  with  other  sensory 
systems  as  partners  in  helping  to  guide  actions. 

1.  Evolution  of  Perceptual  Systems.  Vision,  like  other  sensory  functions, 
has  its  evolutionary  rationale  rooted  in  improved  motor  control.  Although 
organisms  can  of  course  see  when  motionless  or  paralyzed,  the  visual  sys¬ 
tem  of  the  brain  has  the  organization,  computational  profile,  and  archi¬ 
tecture  it  has  in  order  to  facilitate  the  organism's  thriving  at  the  four  Fs: 
feeding  fleeing,  fighting,  and  reproduction.  By  contrast,  a  pure  visionary 
would  say  that  the  visual  system  creates  a  fully  elaborated  model  of  the 
world  in  the  brain,  and  that  the  visual  system  can  be  studied  and  modeled 
without  worrying  too  much  about  the  nonvisual  influences  on  vision. 

2.  Visual  Semiworlds .  What  we  see  at  any  given  moment  is  a  partially 
elaborated  representation  of  the  visual  scene;  only  immediately  relevant 
information  is  explicitly  represented.  The  eyes  saccade  every  200  or  300 
msec,  scanning  an  area.  How  much  of  the  visual  field,  and  within  that,  how 
much  of  the  foveated  area,  is  represented  in  detail  depends  on  many  fac¬ 
tors,  including  the  animal's  interests  (food,  a  mate,  novelty,  etc.),  its  long- 
and  short-term  goals,  whether  the  stimulus  is  refoveated,  whether  the  stim¬ 
ulus  is  simple  or  complex,  familiar  or  unfamiliar,  expected  or  unexpected, 
and  so  on.  Although  unattended  objects  may  be  represented  in  some  min- 
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Figure  2.1  The  scan  path  of  saccadic  eye  movements  made  by  a  subject  viewing  the  picture. 
(Reprinted  with  permission  from  Yarbus  1967.) 

imal  fashion  (sufficient  to  guide  attentional  shifts  and  eye  movements,  for 
example)  they  are  not  literally  seen  in  the  sense  of  'Visually  experienced." 

3.  Interactive  Vision  and  Predictive  Visual  Learning .  Interactive  vision  is 
exploratory  and  predictive.  Visual  learning  allows  an  animal  to  predict 
what  will  happen  in  the  future;  behavior,  such  as  eye  movements,  aids 
in  updating  and  upgrading  the  predictive  representations.  Correlations 
between  the  modalities  also  improve  predictive  representations,  especially 
in  the  murk  and  ambiguity  of  real-world  conditions.  Seeing  an  uncommon 
stimulus  at  dusk  such  as  a  skunk  in  the  bushes  takes  more  time  than  seeing 
a  common  animal  such  as  a  dog  in  full  light  and  in  full,  canonical  view. 
The  recognition  can  be  faster  and  more  accurate  if  the  animal  can  make 
exploratory  movements,  particularly  of  its  perceptual  apparatus,  such  as 
whiskers,  ears,  and  eyes.  There  is  some  sort  of  integration  across  time  as 
the  eyes  travel  and  retravel  a  scan  path  (figure  2.1),  foveating  again  and 
again  the  significant  and  salient  features.  One  result  of  this  integration 
is  the  strong  but  false  introspective  impression  that  at  any  given  moment 
one  sees,  crisply  and  with  good  definition,  the  whole  scene  in  front  of  one. 
Repeated  exposure  to  a  scene  segment  is  connected  to  greater  elaboration 
of  the  signals  as  revealed  by  more  and  more  specific  pattern  recognition 
[(e.g.,  (1)  an  animal,  (2)  a  bear,  (3)  a  grizzly  bear  with  cubs,  (4)  the  mother 
bear  has  not  yet  seen  us]. 

4.  Motor  System  and  Visual  System.  A  pure  visionary  typically  assumes 
that  the  connection  to  the  motor  system  is  made  only  after  the  scene  is 
fully  elaborated.  His  idea  is  that  the  decision  centers  make  a  decision 
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about  what  to  do  on  the  basis  of  the  best  and  most  complete  representation 
of  the  external  world.  An  interactive  visionary,  by  contrast,  will  suggest 
that  motor  assembling  begins  on  the  basis  of  preliminary  and  minimal 
analysis.  Some  motor  decisions,  such  as  eye  movements,  head  movements, 
and  keeping  the  rest  of  the  body  motionless,  are  often  made  on  the  basis  of 
minimal  analysis  precisely  in  order  to  achieve  an  upgraded  and  more  fully 
elaborated  visuomotor  representation.  Keeping  the  body  motionless  is  not 
doing  nothing,  and  may  be  essential  to  getting  a  good  view  of  shy  prey.  A 
very  simple  reflex  behavior  (e.g.,  nociceptive  reflex)  may  be  effected  using 
rather  minimal  analysis,  but  planning  a  complex  motor  act,  such  as  stalking 
a  prey,  may  require  much  more.  In  particular,  complex  acts  may  require  an 
antecedent  "inventorying"  of  sensorimotor  predictions:  what  will  happen 
if  I  do  a,  b,  and  g;  how  should  I  move  if  the  X  does  p,  and  so  forth. 

In  computer  science,  pioneering  work  exploring  the  computational  re¬ 
sources  of  a  system  whose  limb  and  sensor  movements  affect  the  process¬ 
ing  of  visual  inputs  is  well  underway,  principally  in  research  by  R.  Bajcsy 
(1988),  Dana  Ballard  (Ballard  1991;  Ballard  et  al.  1992;  Ballard  and  White- 
head,  1991;  Whitehead  and  Ballard,  1991,  Randall  Beer  (1990)  and  Rodney 
Brooks  (1989).  Other  modelers  have  also  been  alerted  to  potential  compu¬ 
tational  economies,  and  a  more  integrative  approach  to  computer  vision  is 
the  focus  of  a  collection  of  papers.  Active  Vision  (1993),  edited  by  Andrew 
Blake  and  Alan  Yuille. 

5.  Not  a  Good-Old-Fashioned  Hierarchy  Recognition .  The  recognition  (in¬ 
cluding  predictive,  what-next  recognition)  in  the  real-world  case  depends 
on  richly  recurrent  networks,  some  of  which  involve  recognition  of  visuo¬ 
motor  patterns,  such  as,  roughly,  "this  critter  will  make  a  bad  smell  if  I 
chase  it,"  "that  looks  like  a  rock  but  it  sounds  like  a  rattlesnake,  which 
might  bite  me."  Consequently,  the  degree  to  which  sensory  processing 
can  usefully  be  described  as  hierarchical  is  moot.  Rich  recurrence,  es¬ 
pecially  with  continuing  multicortical  area  input  to  the  thalamus  and  to 
motor  structures,  appears  to  challenge  the  conventional  conception  of  a 
chiefly  unidirectional,  low-to-high  processing  hierarchy.  Of  course,  tem¬ 
porally  distinct  stages  between  the  time  photons  strike  the  retina  and  the 
time  the  behavior  begins  do  exist.  There  are,  as  well,  stages  in  the  sense 
of  different  synaptic  distances  from  the  sensory  periphery  and  the  motor 
periphery.  Our  aim  is  not,  therefore,  to  gainsay  stages  per  se,  but  only  to 
challenge  the  more  theoretically  emburdened  notion  of  a  strict  hierarchy. 
No  obvious  replacement  term  for  "hierarchy"  suggests  itself,  and  a  new 
set  of  concepts  adequate  to  describing  interactive  systems  is  needed.  (Ap¬ 
proaching  the  same  issues,  but  from  the  perspective  of  neuropsychology, 
Antonio  Damasio  also  explores  related  ideas  [see  Damasio  1989  b,d]). 

6.  Memory  and  Vision.  Rich  recurrence  in  network  processing  also  means 
that  stored  information  from  earlier  learning  plays  a  role  in  what  the  animal 
literally  sees.  A  previous  encounter  with  a  porcupine  makes  a  difference 
to  how  a  dog  sees  the  object  on  the  next  encounter.  A  neuroscientist  and  a 
rancher  do  not  see  the  same  thing  in  figure  2.2.  The  neuroscientist  cannot 
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Figure  2.2  Stereo  pair  of  a  reconstructed  layer  five  pyramidal  neuron  from  cat  visual  cortex 
(courtsey  of  Rodney  Douglas).  The  apical  dendrite  extends  through  the  upper  layers  of  the 
cortex  and  has  an  extensive  arborization  in  layer  1.  This  neuron  can  be  fused  by  placing  a  sheet 
of  cardboard  between  the  two  images  and  between  your  two  eyes.  Look  "through"  the  figure 
to  diverge  your  eyes  sufficiently  to  bring  the  two  images  into  register.  The  basal  dendrites, 
which  receive  a  majority  of  the  synapses  onto  the  cell,  fill  a  ball  in  three-dimensional  space. 
Apical  dendritic  tufts  form  clusters. 

help  but  see  it  as  a  neuron;  the  rancher  wonders  if  it  might  be  a  kind  of 
insect.  A  sheep  rancher  looking  over  his  flock  recognizes  patterns,  such  as 
a  ewe  with  lambing  troubles,  to  which  the  neuroscientist  is  utterly  blind. 
The  latency  for  fusing  a  Julesz  random-dot  stereogram  is  much  shorter 
with  practice,  even  on  the  second  try.  Some  learning  probably  takes  place 
even  in  very  early  stages. 

7.  Pragmatics  of  Research.  In  studying  nervous  systems,  it  seems  reason¬ 
able  to  try  to  isolate  and  understand  component  systems  before  trying  to 
see  how  the  component  system  integrates  with  other  brain  functions.  Nev¬ 
ertheless,  if  the  visual  system  is  intimately  and  multifariously  integrated 
with  other  functions,  including  motor  control,  approaching  vision  from 
the  perspective  of  sensorimotor  representation  and  computations  may  be 
strategically  unavoidable.  Like  the  study  of  "pure  blood"  or  "pure  diges¬ 
tion,"  the  study  of  "pure  vision"  may  take  us  only  so  far. 

Our  perspective  is  rooted  in  neuroscience  (see  also  Jeannerod  and  Decety 
1990).  We  shall  mainly  focus  on  three  broad  questions:  (1)  Is  there  empirical 
plausibility — chiefly,  neurobiological  and  psychological  plausibility — to 
the  interactive  perception  approach?  (2)  What  clues  are  available  from  the 
nervous  system  to  tell  us  how  to  develop  the  interactive  framework  beyond 
its  nascent  stages?  and  (3)  What  computational  advantage  would  such 
an  interactive  approach  have  over  traditional  computational  approaches? 
Under  this  aegis,  we  shall  raise  issues  concerning  possible  reinterpretation 
of  existing  neurobiological  data,  and  concerning  the  implications  for  the 
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problem  of  learning  in  nervous  systems.  Emerging  from  this  exploration 
is  a  general  direction  for  thinking  about  interactive  vision. 

IS  PERCEPTION  INTERACTIVE? 

Visual  Psychophysics 

In  the  following  subsections,  we  briefly  discuss  various  psychophysical  ex¬ 
periments  that  incline  us  to  favor  the  interactive  framework.  In  general, 
these  experiments  tend  to  show  that  whatever  stages  of  processing  are  re¬ 
ally  involved  in  vision,  the  idea  of  a  largely  straightforward  hierarchy  from 
"early  processes"  (detection  of  lines,  shape  from  shading,  stereo)  to  "later 
processes"  (pattern  recognition)  is  at  odds  with  the  data  (see  also  Rama- 
chandran  1986;  Nakayama  and  Shimojo  1992;  Zijang  and  Nakayama  1992). 

Are  There  Global  Influences  on  Local  Computation?  Subjective  Motion 
Experiments  Seeing  a  moving  object  requires  that  the  visual  system  solve 
the  problem  of  determining  which  features  of  the  earlier  presentation  go 
with  which  features  of  the  later  presentation  (also  known  as  the  Correspon¬ 
dence  Problem).  In  his  work  in  computer  vision,  Ullman  (1979)  proposed 
a  solution  to  this  problem  that  avoids  global  constraints  and  relies  only  on 
local  information.  His  algorithm  solves  the  problem  by  trying  out  all  pos¬ 
sible  matches  and  through  successive  iterations  it  finds  the  set  of  matches 
that  yields  the  minimum  total  distance.  A  computer  given  certain  corre¬ 
spondence  tasks  and  running  Ullman's  algorithm  will  perform  the  task. 
His  results  show  that  the  problem  can  be  solved  locally,  and  insofar  it  is 
an  important  demonstration  of  possibility.  To  understand  how  biological 
visual  systems  really  solve  the  problem,  we  need  to  discover  experimen¬ 
tally  whether  global  factors  play  a  role  in  the  system's  perceptions.  In  the 
examples  discussed  in  this  section,  "global"  refers  to  broad  regions  of  the 
visual  field  as  opposed  to  "local,"  meaning  very  small  regions  such  as  the 
receptive  fields  of  cells  in  the  parafoveal  region  of  VI  (~  1°)  or  V4  (~  5°). 

1.  Bistable  Quartets.  The  displays  shown  in  figure  2.3  are  produced  on 
a  television  screen  in  fast  alternation — the  first  array  of  dots  (A:  coded  as 
filled),  then  the  second  array  of  dots  (B:  coded  as  open),  then  A  then  B,  as  in 
a  moving  picture.  The  brain  matches  the  two  dots  in  A  with  dots  in  B,  and 
subjects  see  the  dots  moving  from  A  position  to  B  position.  Subjects  see 
either  horizontal  movement  or  vertical  movement;  they  do  not  see  diagonal 
movement.  The  display  is  designed  to  be  ambiguous,  in  that  for  any  given 
A  dot,  there  is  both  a  horizontal  B  dot  and  also  a  vertical  B  dot,  to  which  it 
could  correspond.  Although  the  probability  is  0.5  of  seeing  any  given  A-B 
pair  oscillating  in  a  given  direction,  in  fact  observers  always  see  the  set  of 
dots  moving  as  a  group — they  all  move  vertically  or  all  move  horizontally 
(Ramachandran  and  Anstis,  1983).  Normal  observers  do  not  see  a  mixture 
of  some  horizontal  and  some  vertical  movements.  This  phenomenon  is  an 
instance  of  the  more  general  class  of  effects  known  as  motion  capture,  and 
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Figure  2.3  Bistable  quartets.  This  figure  shows  that  when  the  first  array  of  dots  (represented 
by  filled  circles,  and  indicated  by  A  in  the  top  left  quartet)  alternate  with  the  second  array  of 
dots  (represented  by  open  circles,  indicated  by  B  in  the  top  left  quartet).  Subjects  see  either 
all  vertical  or  all  horizontal  oscillations.  Normal  observers  do  not  see  a  mixture  of  some 
horizontal  and  some  vertical  movements,  nor  do  they  see  diagonal  movement.  (Based  on 
Ramachandran  and  Anstis  1983) 

it  strongly  suggests  that  global  considerations  are  relevant  to  the  brain's 
strategy  for  dealing  with  the  correspondence  problem.  Otherwise,  one 
would  expect  to  see,  at  least  some  of  the  time,  a  mix  of  horizontal  and 
vertical  movements. 

2.  Behind  the  Occluder  (figure  2.4).  Suppose  both  the  A  frame  and  the 
B  frame  contain  a  shaded  square  on  the  righthand  side.  Now,  if  all  dots 
in  the  A  group  blink  off  and  only  the  uppermost  and  lowermost  dots  of 
the  B1  group  blink  on,  subjects  see  all  A  dots,  move  to  the  B1  location, 
including  the  middle  A  dot,  which  is  seen  to  move  behind  the  "virtual" 
occluder.  (It  works  just  as  well  if  the  occluder  occupies  upper  or  lower 
positions.)  If,  however,  A  contains  only  one  dot  in  the  middle  position 
on  the  left  plus  the  occluding  square  on  its  right,  when  that  single  dot 
merely  blinks  off,  subjects  do  not  see  the  dot  move  behind  the  occluder. 
They  see  a  square  on  the  right  and  a  blinking  dot  on  the  left.  Because 
motion  behind  the  occluder  is  seen  in  the  context  of  surrounding  subjective 
motion  but  not  in  the  context  of  the  single  dot,  this  betokens  the  relevance 
of  surrounding  subjective  motion  to  subjective  motion  of  a  single  spot. 
Again,  this  suggests  that  the  global  properties  of  the  scene  are  important 
in  determining  whether  subjects  see  a  moving  dot  or  a  stationary  blinking 
light  (Ramachandran  and  Anstis  1986). 

3.  Cross-Modal  Interactions.  Suppose  the  display  consist  of  a  single  blink¬ 
ing  dot  and  a  shaded  square  (behind  which  the  moving  dot  could  "hide"). 
As  before,  A  and  B  are  alternately  presented — first  A  (dot  plus  occluder), 
then  B  (occluder  only),  then  A,  then  B.  As  noted  above,  the  subject  sees 
no  motion  (figure  2.4  III).  Now,  however,  change  conditions  by  adding  an 
auditory  stimulus  presented  by  earphones.  More  exactly,  the  change  is  this: 
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Figure  2.4  This  figure  shows  the  stimuli  used  to  elicit  the  phenomenon  of  illusory  motion 
behind  an  occluder.  When  the  occluder  is  present,  the  subjects  perceive  all  the  dots  move  to 
the  right,  including  the  middle  left  dot,  which  is  seen  to  move  to  the  right  and  behind  the 
square.  In  the  absence  of  the  occluder,  the  middle  dot  appears  to  move  to  the  upper  right. 
When  the  display  is  changed  so  that  only  the  middle  dot  remains  while  upper  and  lower 
dots  are  removed,  the  middle  dot  is  seen  to  merely  blink  off  and  on,  but  not  to  move  behind 
the  occluder.  When,  however,  a  tone  is  presented  in  the  left  ear  simultaneously  with  the  dot 
coming  on,  and  in  the  right  ear  simultaneously  with  the  dot  going  off,  subjects  do  see  the 
single  dot  move  behind  the  occluder.  (Based  on  Ramachandran  and  Anstis  1986) 

Simultaneous  with  the  blinking  on  of  the  light,  a  tone  is  sounded  in  the  left 
ear;  simultaneous  with  the  blinking  off,  a  tone  is  sounded  in  the  right  ear. 
With  the  addition  of  the  auditory  stimulus,  subjects  do  indeed  see  the  sin¬ 
gle  dot  move  to  the  right  behind  the  occluder.  In  effect,  the  sound  "pulls" 
the  dot  in  the  direction  in  which  the  sound  moves  (Ramachandran,  Intrili- 
gator,  and  Cavanaugh,  unpublished  observations).  In  this  experiment,  the 
cross-modal  influence  on  what  is  seen  is  especially  convincing  evidence 
for  some  form  of  interactive  vision  as  opposed  to  a  pure,  straight  through, 
noninteractive  hierarchy.  (A  weak  subjective  motion  effect  can  be  achieved 
when  the  blinking  of  the  light  is  accompanied  by  somatosensory  left-right 
vibration  stimulation  to  the  hands.  Other  variations  on  this  condition  could 
be  tried.) 

It  comes  as  no  surprise  that  visual  and  auditory  information  is  inte¬ 
grated  at  some  stage  in  neural  processing.  After  all,  we  see  dogs  barking 
and  drummers  drumming.  What  is  surprising  in  these  results  is  that  the 
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Figure  2.5  Two  frames  in  an  apparent  motion  display.  The  four  Pacmen  give  rise  to  the 
perception  of  an  occluding  square  that  moves  from  the  left  circles  to  the  right  circles. 


auditory  stimulus  has  an  effect  on  a  process  (motion  correspondence)  that 
pure  vision  orthodoxy  considers  "early."  In  this  context  it  is  appropriate  to 
mention  also  influence  in  the  other  direction— of  vision  on  hearing.  Seeing 
the  speaker's  lips  move  has  a  significant  effect  on  auditory  perception  and 
has  been  especially  well  documented  in  the  McGurk  effect. 

4.  Motion  Correspondence  and  the  Role  of  Image  Segmentation.  Figure  2.5 
shows  two  frames  of  a  movie  in  which  the  first  frame  has  four  Pacmen  on 
the  left,  and  the  second  has  four  Pacmen  on  the  right.  In  the  movie,  the 
frames  are  alternated,  and  the  disks  are  in  perfect  registration  from  one 
frame  to  the  next.  What  observers  report  seeing  is  a  foreground  opaque 
square  shifting  left  and  right,  occluding  and  revealing  the  four  black  disks 
in  the  background.  Subjects  never  report  seeing  pacmen  opening  and 
closing  their  mouths;  they  never  report  seeing  illusory  squares  flashing 
off  and  on.  Moreover,  when  a  template  of  this  movie  was  then  projected 
on  a  regular  grid  of  dots,  the  dots  inside  the  subjective  square  appeared  to 
move  with  the  illusory  surface  even  though  they  were  physically  stationary 
(figure  2.6).  "Outside"  dots  did  not  move  (Ramachandran  1985). 

These  experiments  imply  that  the  human  visual  system  does  not  al¬ 
ways  solve  the  correspondence  problem  independently  of  the  segmenta¬ 
tion  problem  (the  problem  of  what  features  are  parts  belonging  to  the  same 
thing),  though  pure  visionaries  tend  to  expect  that  solving  segmentation  is 
a  late  process  that  kicks  in  after  the  correspondence  problem  is  solved .  Sub¬ 
jects'  overwhelming  preference  for  the  "occluding  square"  interpretation 
over  the  "yapping  Pacmen"  interpretation  indicates  that  the  solution  to  the 
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Figure  2.6  When  dots  are  added  to  the  background  of  figure  2.5,  those  dots  internal  to 
the  occluding  square  appear  to  move  with  it  when  it  occludes  the  right  side  circles.  The 
background  dots,  however,  appear  stationary.  (Based  on  Ramachandran  1985) 


segmentation  problem  itself  involves  large-scale  effects  dominating  over 
local  constraints.  If  seeing  motion  in  this  experiment  depended  on  solving 
the  correspondence  problem  at  the  local  level,  then  presumably  yapping 
Pacmen  would  be  seen.  The  experiment  indicates  that  what  are  matched 
between  frames  are  the  larger  scale  and  salient  features;  the  smaller  scale 
features  are  pulled  along  with  the  global  decision. 

Are  the  foregoing  examples  really  significant?  A  poo-pooing  strategy 
may  downplay  the  effects  as  minor  departures  ("biology  will  be  biology"). 
To  be  sure,  a  theory  can  always  accommodate  any  given  "anomaly"  by 
making  some  corrective  adjustment  or  other.  Nevertheless,  as  anomalies 
accumulate,  what  passed  as  corrective  adjustments  may  come  to  be  de¬ 
plored  as  ad  hoc  theory-savers.  A  phenomenon  is  an  anomaly  only  relative 
to  a  background  theory,  and  if  the  history  of  science  teaches  us  anything, 
it  is  that  one  theory's  anomaly  is  another  theory's  prototypical  case.  Thus 
"retrograde  motion"  of  the  planets  was  an  anomaly  for  geocentric  cosmol- 
ogists  but  a  typical  instance  for  Galileo;  the  perihelion  advance  of  Mercury 
was  an  anomaly  for  Newtonian  physics,  but  a  typical  instance  for  Ein- 
steinian  physics.  Any  single  anomaly  on  its  own  may  not  be  enough  to 
switch  investment  to  a  new  theoretical  framework.  The  cumulative  effect 
of  an  assortment  of  anomalies,  however,  is  another  matter. 

Can  Semantic  Categorization  Affect  Shape-from-Shading?  Helmholtz 
observed  that  a  hollow  mask  presented  from  the  "inside"  (the  concave 
view,  with  the  nose  extending  away  from  the  observer)  about  2  m  from 
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the  observer  is  invariably  perceived  as  a  convex  mask  with  features  pro¬ 
truding  (nose  coming  toward  the  observer).  In  more  recent  experiments, 
Ramachandran  (1988)  found  that  the  concave  mask  continues  to  be  seen 
as  convex  even  when  is  it  is  illuminated  from  below,  a  condition  that  of¬ 
ten  suffices  to  reverse  a  perception  of  convexity  to  one  of  concavity.  This 
remains  true  even  when  the  subject  is  informed  about  the  direction  of  il¬ 
lumination  of  the  mask.  Perceptual  persistence  of  the  convex  mask  as  a 
concave  mask  shows  a  strong  top-down  effect  on  an  allegedly  early  visual 
task,  namely  determining  shape  from  shading. 

Does  this  perceptual  reversal  of  the  hollow  mask  result  from  a  generic 
assumption  that  many  objects  of  interest  (nuts,  rocks,  berries,  fists,  breasts) 
are  usually  convex  or  that  faces  in  particular  are  typically  convex?  That 
is,  does  the  categorization  of  the  image  as  a  face  override  the  shading  cues 
such  that  the  reversal  is  a  very  strong  effect?  To  address  this  question, 
Ramachandran,  Gregory,  and  Maddock  (unpublished  observations)  pre¬ 
sented  subjects  with  two  masks:  one  is  right  side  up  and  the  other  is  upside 
down.  Upside-down  faces  are  often  poorly  analyzed  with  respect  to  fea¬ 
tures,  and  an  upside-down  mask  may  not  be  seen  as  having  facial  features 
at  all.  In  any  case,  upright  faces  are  what  we  normally  encounter.  In  the 
experiment,  subjects  walk  slowly  backward  away  from  the  pair  of  stimuli, 
starting  at  0.5  m,  moving  to  5.0  m.  At  a  close  distance  of  about  0.5  m  sub¬ 
jects  correctly  see  both  inverted  masks  as  inverted  (concave).  At  about  1  m, 
subjects  usually  see  the  upright  mask  as  convex;  the  upside-down  mask, 
however,  is  still  seen  as  concave  until  viewing  distance  is  about  1. 5-2.0  m, 
whereupon  subjects  tend  to  see  it  too  as  convex.  The  stimuli  are  identical 
save  for  orientation,  yet  one  is  seen  as  concave  and  the  other  as  convex. 
Hence  this  experiment  convincingly  illustrates  that  an  allegedly  "later" 
process  (face  categorization)  has  an  effect  on  an  allegedly  "earlier"  process 
(the  shading  predicts  thus  and  such  curvatures)  (figure  2.7). 

Can  Subjective  Contours  Affect  Stereoscopic  Depth  Perception?  Stereo 
vision  has  been  cited  (Poggio  et  al.  1985)  as  an  early  vision  task,  one  that 
is  accomplished  by  an  autonomous  module  prior  to  solving  segmenta¬ 
tion  and  classification.  That  we  can  fuse  Julesz  random  dot  stereograms 
to  see  figures  in  depth  is  evidence  for  the  idea  that  matching  for  stereo 
can  be  accomplished  with  matching  of  local  features  only,  independently 
of  global  properties  devolving  from  segmentation  or  categorization  deci¬ 
sions.  While  the  Julesz  stereogram  is  indeed  a  stunning  phenomenon,  the 
correspondence  problem  it  presents  is  entirely  atypical  of  the  correspon¬ 
dence  problem  in  the  real  world.  The  logical  point  here  should  be  spelled 
out:  "Not  always  dependent  on  a"  does  not  imply  "Not  standardly  de¬ 
pendent  on  a/'  let  alone,  "Never  dependent  on  a."  Hence  the  question 
remains  whether  in  typical  real  world  conditions,  stereo  vision  might  in 
fact  make  use  of  top-down,  global  information.  To  determine  whether 
under  some  conditions  the  segmentation  data  might  be  used  in  solving 
the  correspondence  problem,  Ramachandran  (1986)  designed  stereo  pairs 
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Figure  2.7  The  hollow  mask,  photographed  from  its  concave  orientation  (as  though  you  are 
about  to  put  it  on).  In  (A)  the  light  comes  from  above;  in  ( B )  light  comes  from  both  sides. 


where  the  feature  that  must  be  matched  to  see  stereoptic  depth  is  some 
high-level  property.  The  choice  was  subjective  contours,  allegedly  the  re¬ 
sult  of  "later"  processing  (figure  2.8). 

In  the  monocular  viewing  condition,  illusory  contours  can  be  seen  in 
any  of  the  four  displays  (above).  The  top  pair  can  be  stereoscopically 
fused  so  that  one  sees  a  striped  square  standing  well  in  front  of  a  back¬ 
ground  consisting  of  black  circles  on  a  striped  mat.  The  bottom  pair  can 
also  be  stereoptically  fused.  Here  one  sees  four  holes  in  the  striped  fore¬ 
ground  mat,  and  through  the  holes,  well  behind  the  striped  mat,  one  sees 
a  partially  occluded  striped  square  on  a  black  background.  These  are  es¬ 
pecially  surprising  results,  because  the  stripes  of  the  perceived  foreground 
and  the  perceived  background  are  at  zero  disparity.  The  only  disparity 
that  exists  on  which  the  brain  can  base  a  stereo  depth  perception  comes 
from  the  subjective  contour. 

According  to  pure  vision  orthodoxy,  perceiving  subjective  contours  is 
a  "later"  effect  requiring  global  integration,  in  contrast  to  finding  stereo 
correspondences  for  depth,  which  is  considered  an  "earlier"  effect.  This 
result,  however,  appears  to  be  an  example  of  "later"  influencing — in  fact 
enabling — "earlier."  It  should  also  be  emphasized  that  the  emergence  of 
qualitatively  different  percepts  (lined  square  in  front  of  disks  versus  lined 
square  behind  portholes)  cannot  be  accounted  for  by  any  existing  stereo  al¬ 
gorithms  that  standardly  predict  a  reversal  in  sign  of  perceived  depth  only 
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Figure  2.8  By  fusing  the  upper  stereo  pairs,  one  sees  a  striped  square  standing  well  fore  of 
a  background  consisting  of  black  circles  on  a  striped  mat.  By  fusing  the  bottom  pair,  one  sees 
four  holes  in  the  striped  foreground  mat,  and  through  the  holes,  a  partially  occluded  striped 
square  on  a  black  background.  (This  assumes  fusion  by  divergence.  The  opposite  order  is 
available  to  those  who  fuse  by  convergence.)  In  both  cases,  the  stripes  are  at  zero  disparity. 
(Based  on  Ramachandran  1986) 


if  the  disparities  are  reversed.  At  the  risk  of  repetition,  we  note  again  that  in 
figure  2.8  (top  and  bottom),  the  lines  are  at  zero  disparity  (Ramachandran 
1986;  Nakayama  and  Shimojo  1992). 

Can  Shape  Recognition  Affect  Figure-Ground  Relationships?  Figure- 
ground  identification  is  generally  thought  to  precede  shape  recognition, 
but  recent  experiments  using  the  Rubin  vase/faces  stimulus  demonstrate 
that  shape  recognition  can  contribute  to  the  identification  of  figure-ground 
(Peterson  and  Gibson,  1991). 

Does  the  discovery  of  cells  in  VI  and  V2  that  respond  to  subjective  con¬ 
tours  (see  below,  p.  45)  mean  that  detecting  subjective  contours  is  an  early 
achievement  after  all?  Not  necessarily.  The  known  physiological  facts  are 
consistent  both  with  the  "early  effects"  possibility  as  well  as  with  a  "later 
effect  backsignaled"  possibility.  Further  neurobiological  and  modeling  ex¬ 
periments  will  help  answer  which  possibility  is  realized  in  the  nervous 
system. 

Visual  Attention 

An  hypothesis  of  interactive  vision  claims  that  the  brain  probably  does  not 
create  and  maintain  a  visual  world  representation  that  corresponds  detail- 
by-detail  to  the  visual  world  itself.  For  one  thing,  it  need  not,  since  the 
world  itself  is  highly  stable  and  conveniently  "out  there"  to  be  sampled 
and  resampled.  On  any  given  fixation,  the  brain  can  well  make  do  with 
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a  partially  elaborated  representation  of  the  world  (O'Regan  1992;  Ballard 
1991;  Dennett  1992).  As  O'Regan  (1992)  puts  it,  "the  visual  environment 
functions  as  a  sort  of  outside  memory  store." 

For  another  thing,  as  some  data  presented  below  suggest,  the  brain  prob¬ 
ably  does  not  create  and  maintain  a  picture-perfect  world  representation. 
We  conjecture  that  the  undeniable  feeling  of  having  whole  scene  visual 
representation  is  the  result  mainly  of  (1)  repeated  visual  visits  to  stimuli  in 
the  scene,  (2)  short-term  semantic  memory  on  the  order  of  a  few  seconds 
that  maintains  the  general  sense  of  what  is  going  on  without  creating  and 
maintaining  the  point-by-point  detail,  (3)  the  brain's  "objectification"  of 
sensory  perception  such  that  a  signal  processed  in  cortex  is  represented  as 
being  about  an  object  in  space,  i.e.,  feeling  a  bum  on  the  hand,  seeing  a 
skunk  in  the  grass,  hearing  a  train  approaching  from  the  north,  etc.,  and 
(4)  the  predictive  dimension  of  pattern  recognition,  i.e.,  recognizing  some¬ 
thing  as  a  burning  log  involves  recognizing  that  it  will  bum  my  hand  if  I 
touch  it,  that  smokey  smells  are  produced,  that  water  will  quench  the  fire, 
that  sand  will  smother  it,  that  meat  tastes  better  when  browned  on  it,  that 
the  fire  will  go  out  after  a  while,  and  so  on  and  on. 

Evidence  supporting  the  "partial-representation  per  glimpse"  or  "semi¬ 
world"  hypothesis  derives  from  research  using  on-line  computer  control  to 
change  what  is  visible  on  a  computer  display  as  a  function  of  the  subject's 
eye  movements.  When  major  display  changes  are  made  during  saccades, 
those  changes  are  rarely  noticed,  even  when  they  involve  bold  alterations 
of  color  of  whole  objects,  or  when  the  changes  consist  in  removal,  shifting 
about,  or  addition  of  objects  such  as  cars,  hats,  trees,  and  people  (McConkie 
1990).  The  exception  is  when  the  subject  is  explicitly  paying  attention  to  a 
certain  feature,  watching  for  a  change. 

Many  careful  studies  using  text-reading  tasks  elegantly  support  the 
"partial-representation  per  glimpse"  hypothesis.  These  studies  use  a  "mov¬ 
ing  window  paradigm"  in  which  subjects  read  a  line  of  text  that  contains  a 
window  of  normal  text  surrounded  fore  and  aft  by  "junk"  text.  As  readers 
move  their  eyes  along  the  line,  the  window  moves  with  the  eyes  (Mc¬ 
Conkie  and  Rayner  1975;  Rayner  et  al.  1980;  O'Regan  1990)  (figure  2.9). 
The  strategy  is  to  discover  the  spatial  extent  of  the  zone  from  which  useful 
information  is  extracted  on  a  given  fixation  by  varying  the  size  of  the  win¬ 
dow  and  testing  using  reading  rate  and  comprehension  measures.  This 
zone  is  called  the  "perceptual"  or  "attentional"  span.  If  at  a  given  window 
width  reading  rate  or  comphrehension  declines  from  a  reader's  baseline, 
it  is  presumed  that  surrounding  junk  text  has  affected  reading,  and  hence 
that  reader 's  attentional  span  is  wider  than  the  size  of  the  window.  By  find¬ 
ing  the  smallest  width  at  which  reading  is  unaffected,  a  reader's  attentional 
span  can  be  quite  precisely  calibrated. 

In  typical  subjects,  reading  text  the  size  you  are  now  reading,  the  atten¬ 
tional  span  is  about  17-18  characters  in  width,  and  it  is  asymmetric  about 
the  point  of  fixation,  with  about  2-3  characters  to  the  left  of  fixation  and 
about  15  characters  to  the  right.  On  the  other  hand,  should  you  be  reading 
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This  sentence  shows  the  nature  of  the  perceptual  span. 
* 


xxxxxxxxxxx  shows  the  nature  xxxxxxxxxxxxxxxxxxxx 
* 


Maximum  Perceptual  Span 

2-3  character  spaces  left  (beginning  of  current  word). 

15  character  spaces  right  (2  words  beyond  current  word). 

Figure  2.9  The  attentional  ("perceptual")  span  is  defined  as  that  zone  from  which  useful 
information  can  be  extracted  on  a  given  fixation.  Fixation  point  is  indicated  by  an  asterisk. 
This  displays  the  width  of  the  attentional  span  and  the  asymmetry  of  the  span  (Courtesy  John 
Henderson) 

Hebrew  instead  of  English,  and  hence  traveling  from  page  right  to  page  left, 
the  attention  span  will  be  about  2-3  characters  to  the  right  and  15  to  the  left 
(Pollatsek  et  al  1981),  or  reading  Japanese,  in  which  case  it  is  asymmetric 
in  the  vertical  dimension  (Osaka  and  Oda  1991).  This  means  that  subjects 
read  as  well  when  junk  text  surround  the  17-18  character  span  as  when 
the  whole  line  is  visible,  but  read  less  well  if  the  window  is  narrowed  to  14 
or  12  characters.  At  17-18  character  window  width,  the  surrounding  junk 
text  is  simply  never  noticed.  Interestingly,  it  remains  entirely  unnoticed 
even  when  the  reading  subject  is  told  that  the  moving  window  paradigm 
is  running  (McConkie  1979;  O'Regan  1990;  Henderson  1992). 

Further  experiments  using  this  paradigm  indicate  that  a  shift  in  visual 
attention  precedes  saccadic  eye  movement  to  a  particular  location,  pre¬ 
sumably  guiding  it  to  a  location  that  low-level  analysis  deems  the  next 
pretty  good  landing  spot  (Henderson  et  al.  1989).  Henderson  (1993)  pro¬ 
poses  that  visual  attention  binds;  inter  alia,  it  binds  the  visual  stimulus  to 
a  spatial  location  to  enable  a  visuo-motor  representation  that  guides  the 
next  motor  response.  When  the  fovea  has  landed,  some  features  are  seen. 

Experiments  along  very  different  lines  suggest  that  the  information  ca¬ 
pacity  of  attention  per  glimpse  is  too  small  to  contain  a  richly  detailed 
whole-scene  icon.  Verghese  and  Pelli  (1992)  report  results  concerning  the 
amount  of  information  an  observer's  attention  can  handle.  Based  on  their 
results,  they  conclude  that  the  capacity  of  the  attention  mechanism  is  lim¬ 
ited  to  about  44  ±  15  bits  per  glimpse.  Preattentive  mechanisms  (studied  by 
Treisman  and  by  Julesz)  presumably  operate  first,  and  operate  in  parallel. 
Verghese  and  Pelli  calculated  that  the  preattentive  information  capacity  is 
much  greater — about  2106  bits.  The  attentional  mechanism,  in  contrast  to 
the  preattentive  mechanism,  they  believe  to  be  low  capacity.  (Verghese 
and  Pelli  define  a  preattentive  task  as  "one  in  which  the  probability  of  de¬ 
tecting  the  target  is  independent  of  the  number  of  distracter  elements"  and 
an  attentive  task  as  one  in  which  "the  probability  of  detecting  the  target  is 
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inversely  proportional  to  the  number  of  elements  in  the  display"  [p.  983].) 

Verghese  and  Pelli  ran  two  subjects  on  a  number  of  attention  tasks  of 
varying  difficulty,  and  compared  results  across  tasks.  In  a  paradigm  they 
call  "finding  the  dead  fly,"  subjects  are  required  to  detect  the  single  station¬ 
ary  spot  among  moving  spots.  The  complementary  task  of  finding  the  live 
fly — the  moving  object  among  stationary  objects — is  a  preattentive  task  in 
which  the  target  "pops  out."  They  note  that  their  calculation  of  44  ±  15  bits 
is  consistent  with  Sperling's  (1960)  estimate  of  40  bits  for  the  iconic  store. 
In  Sperling's  technique,  an  array  of  letters  was  flashed  to  the  observer.  He 
found  that  subjects  could  report  only  part  of  the  display,  roughly  9  letters 
(=  41  bits). 

There  are  important  dependencies  between  visual  attention,  visual  per¬ 
ception,  and  iconic  memory.  To  a  first  approximation:  (1)  if  you  are  not 
visually  attending  to  a  then  you  do  not  see  a  (have  a  visual  experience  of  a), 
and  (2)  if  you  are  not  attending  to  a  and  you  do  not  have  a  visual  experience 
of  a,  then  you  do  not  have  iconic  memory  for  a.  Given  the  limited  capacity 
of  visual  attention,  these  assumptions  imply  that  the  informational  capac¬ 
ity  of  visual  perception  (in  the  rough  and  ready  sense  of  "literally  seeing") 
is  approximately  as  small  (see  also  Crick  and  Koch  1990b) 

Nevertheless,  some  motor  behavior — and  goal-directed  eye  movement 
in  particular — apparently  does  not  require  conscious  perception  of  the  item 
to  which  the  movement  is  directed,  but  does  require  some  attentional  scan¬ 
ning  and  some  parafoveal  signals  that  presumably  provide  coarse,  easy  to 
extract  visual  cues.  During  reading  a  saccade  often  "lands"  the  fovea  near 
the  third  letter  of  the  word  (close  enough  to  the  "optimal"  viewing  posi¬ 
tion  of  the  word),  and  small  correction  saccades  are  made  when  this  is  not 
satisfactory.  This  implies  that  the  eyes  are  aiming  at  a  target,  and  hence 
that  at  least  crude  visual  processing  has  guided  the  saccade  (McConkie  et 
al.  1988;  Rayner  et  al.  1983;  Kapoula  1984). 

In  concluding  this  section,  we  emphatically  note  that  what  we  have 
discussed  here  is  only  a  small  part  of  the  story  since,  as  Schall  (1991)  points 
out,  orienting  to  a  stimulus  often  involves  more  than  eye  movements.  It 
often  also  involves  head  and  whole  body  movements. 

Considerations  from  Neuroanatomy 

The  received  wisdom  concerning  visual  processing  envisages  information 
flows  from  stage  to  stage  in  the  hierarchy  until  it  reaches  the  highest  stage, 
at  which  point  the  brain  has  a  fully  elaborated  world  model,  ready  for 
motor  consideration.  In  this  section,  we  shall  draw  attention  to,  though  not 
fully  discuss,  some  connectivity  that  is  consistent  with  a  loose,  interactive 
hierarchy  but  casts  doubt  on  the  notion  of  a  strict  hierarchy.  We  do  of  course 
acknowledge  that  so  far  these  data  provide  only  suggestive  signs  that  the 
interactive  framework  is  preferable.  (For  related  ideas  based  on  back- 
projection  data  in  the  context  of  neuropsychological  data,  see  Damasio 
1989b  and  Van  Hoesen  1993.) 
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Backprojections  (Corticocortical)  Typically  in  monkeys,  forward  axon 
projections  (from  regions  closer  in  synaptic  distance  to  the  sensory  periph¬ 
ery  to  regions  more  synaptically  distant;  e.g.,  V2  to  V4)  are  equivalent  to  or 
outnumbered  by  projections  back  (Rockland  and  Pandya  1979;  Rockland 
and  Virga  1989;  Van  Essen  and  Maunsell  1983;  Van  Essen  and  Anderson 
1990).  The  reciprocity  of  many  of  these  projections  (P  to  Q  and  Q  to  P)  has 
been  documented  in  many  areas,  including  connections  back  to  the  LGN 
(figure  2.10).  It  has  begun  to  emerge  that  some  backprojections,  however, 
are  not  merely  reciprocating  feedforward  connections,  but  appear  to  be 
widely  distributed,  including  distribution  to  some  areas  from  which  they 
do  not  receive  projections.  Thus  Rockland  reports  (1992a,b)  injection  data 
showing  that  some  axons  from  area  TE  do  indeed  project  reciprocally  to 
V4,  but  sparser  projections  were  also  seen  to  V2  (mostly  layer  1,  but  some 
in  2  and  5)  and  VI  (layer  1).  These  TE  axons  originated  mainly  in  layers  6 
and  3a  (figure  2.11). 

Diffuse  Ascending  Systems  In  addition  to  the  inputs  that  pass  through 
the  thalamus  to  the  cortex,  there  are  a  number  of  afferent  systems  that 
arise  in  small  nuclei  located  in  the  brainstem  and  basal  forebrain.  These 
systems  include  the  locus  coeruleus,  whose  noradrenergic  axons  course 
widely  throughout  the  cortical  mantle,  the  serotonergic  raphe  nuclei,  the 
ventral  tegmental  area,  which  sends  dopamine  projections  to  the  frontal 
cortex,  and  cholinergic  inputs  emanating  from  various  nuclei,  including 
the  nucleus  basalis  of  Meynert.  These  systems  are  important  for  arousal,  for 
they  control  the  transition  from  sleep  to  wakefulness.  They  also  provide  the 
cortex  with  information  about  the  reward  value  (dopamine)  and  salience 
(noradrenaline)  of  sensory  stimuli.  Another  cortical  input  arises  from  the 
amygdala,  which  conveys  information  about  the  affective  value  of  sensory 
stimuli  to  the  cortex,  primarily  to  the  upper  layers.  Possible  computational 
utility  for  these  diffuse  ascending  system  will  be  presented  later. 

Corticothalamic  Connections  Sensory  inputs  from  the  specific  modali¬ 
ties  project  from  the  thalamus  to  the  middle  layers  (mainly  layer  4)  of  the 
cortex.  Reciprocal  connections  from  each  cortical  area,  mainly  originating 
in  deep  layers,  project  back  to  the  thalamus.  In  visual  cortex  of  the  cat  it  is 

Figure  2.10  (Top)  Schematic  diagram  of  some  of  the  cortical  visual  areas  and  their  connec¬ 
tions  in  the  macaque  monkey.  Solid  lines  indicate  projections  involving  all  portions  of  the 
visual  field  representation  in  an  area;  dotted  lines  indicate  projections  limited  to  the  rep¬ 
resentation  of  the  peripheral  field.  Heavy  arrowheads  indicate  forward  projections;  light 
arrowheads  indicate  backward  projections.  (Reprinted  with  permission  from  Desimone  and 
Ungerleider  1989)  (Bottom)  Laminar  patterns  of  cortical  connectivity  used  for  making  "for¬ 
ward"  and  "backward"  assignments.  Three  characteristic  patterns  of  termination  are  indi¬ 
cated  in  the  central  column.  These  include  preferential  termination  in  layer  4  (the  F  pattern), 
a  columnar  (C)  pattern  involving  approximately  equal  density  of  termination  in  all  layers, 
and  a  multilaminar  (M)  pattern  that  preferentially  avoids  layer  4.  There  are  also  characteris¬ 
tic  patterns  for  cells  of  origin  in  different  pathways.  Filled  ovals,  cells  bodies;  angles,  axon 
terminals.  (Reprinted  with  permission  from  Felleman  and  Van  Essen  1991) 
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known  that  the  VI  projections  back  to  the  LGN  of  the  thalamus  outnumber 
thalamocortical  projections  by  about  10:1. 

Corticofugal  projections  have  collaterals  in  the  reticular  nucleus  of  the 
thalamus.  The  reticular  nucleus  of  the  thalamus  is  a  sheet  of  inhibitory 
neurons,  reminiscent  of  the  skin  of  a  peach.  Both  corticothalamic  axons  as 
well  as  thalamocortical  projection  neurons  have  excitatory  connections  on 
these  inhibitory  neurons  whose  output  is  primarily  back  to  the  thalamus. 
The  precise  function  of  the  reticular  nucleus  remains  to  be  discovered,  but 
it  does  have  a  central  role  in  organizing  sleep  rhythms,  such  as  spindling 
and  delta  waves  in  deep  sleep  (Steriade  et  al.  1993b). 

Connections  from  Visual  Cortical  Areas  to  Motor  Structures  Twenty- 
five  cortical  areas  (cat)  project  to  the  superior  colliculus  (SC)  (Harting  et 
al.  1992).  These  include  areas  17, 18, 19,  20a,  20b,  21a,  and  21b.  Harting  et 
al.  (1992)  found  that  the  corticotectal  projection  areas  17  and  18  terminate 
exclusively  in  the  superficial  layers,  while  the  remaining  23  areas  termi¬ 
nate  more  promiscuously  (figure  2.12).  The  SC  has  an  important  role  in 
directing  saccadic  eye  movements,  and,  in  animals  with  orientable  ears, 
ear  movements. 

Nearly  every  area  of  mammalian  cortex  has  some  projections  to  the  stria¬ 
tum,  with  some  topological  preservation.  Although  the  functions  of  the 
striatum  are  not  well  understood,  the  correlation  between  striatal  lesions 
and  severe  motor  impairments  is  well  known,  and  it  is  likely  that  the  stria¬ 
tum  has  an  important  role  in  integrating  sequences  of  movements.  Lesion 
studies  also  indicate  that  some  parts  of  the  striatum  are  relevant  to  produc¬ 
ing  voluntary  eye  movements,  as  opposed  to  sensory-driven  or  reflex  eye 
movements.  It  appears  that  the  striatum  can  veto  some  reflexive  responses 
via  an  inhibitory  effect  on  motor  structures,  whereas  voluntary  movements 
are  facilitated  by  disinhibitory  striatal  output  to  motor  structures. 

What  is  frustrating  about  this  assembly  of  data,  as  with  neuroanatomy 
generally,  is  that  we  do  not  really  know  what  it  all  means.  The  number 
of  neurons  and  connections  is  bewildering,  and  the  significance  of  pro¬ 
jections  to  one  place  or  another,  of  distinct  cell  populations,  and  so  on, 
is  typically  puzzling.  (See  Young  1992  for  a  useful  startegy  for  clarifying 
the  significance.)  Neuroanatomy  is,  nonetheless,  the  observational  hard- 


Figure  2.11  Schematic  diagram  of  the  feedforward  connections  (solid  lines)  and  backprojec- 
tions  (broken  lines)  in  the  monkey.  What  is  especially  striking  is  that  fibers  from  visual  cortical 
areas  TE  (inferior  temporal  cortex)  and  TEO  (posterior  to  TE  and  anterior  to  V4)  project  all 
the  way  back  to  V2  and  VI.  (Based  on  Rockland  et  al.  1992.) 
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Figure  2.12  Summary  diagram  in  the  sagittal  plane  of  the  superior  colliculus  (SC)  showing 
the  laminar  and  sublaminar  distribution  of  axons  from  cortical  areas  to  the  SC  in  the  cat,  as 
labeled  above  each  sector.  (Reprinted  with  permission  from  Harting  et  al.  1992) 

pan  for  neuroscience,  and  the  data  can  be  provocative  even  when  they  are 
not  self-explanatory.  The  prevalence  and  systematic  character  of  feedback 
loops  are  particularly  provocative,  at  least  because  such  loops  signify  that 
the  system  is  dynamic — that  it  has  time-dependent  properties.  Output 
loops  back  to  affect  new  inputs,  and  it  is  possible  for  a  higher  areas  to 
affect  inputs  of  lower  areas.  The  time  delays  will  matter  enormously  in 
determining  what  capacities  the  system  display. 

The  second  point  is  that  all  cortical  visual  areas,  from  the  lowest  to 
the  highest,  have  numerous  projections  to  lower  brain  centers,  including 
motor-relevant  areas  such  as  the  striatum,  superior  colliculus,  and  cere¬ 
bellum.  The  anatomy  is  consistent  with  the  idea  that  motor  assembly 
can  begin  even  before  sensory  signals  reach  the  highest  levels.  Especially 
for  skilled  actions  performed  in  a  familiar  context,  such  as  reading  aloud, 
shooting  a  basket,  and  hunting  prey,  this  seems  reasonable.  Are  the  only 
movements  at  issue  here  eye  movements?  Probably  not.  Distinguish¬ 
ing  gaze-related  movements  from  extra-gaze  movements  is  anything  but 
straightforward,  for  the  eyes  are  in  the  head,  and  the  head  is  attached  to 
the  rest  of  the  body.  Foveating  an  object,  for  example,  may  well  involve 
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movement  of  the  eyes,  head,  and  neck — and  on  occasion,  the  entire  body. 
Watching  Michael  Jordan  play  basketball  or  a  group  of  ravens  steal  a  cari¬ 
bou  corpse  from  a  wolf  tends  to  underscore  the  integrated,  whole-body 
character  of  visuomotor  coordination. 

Considerations  from  Neurophysiology 

In  keeping  with  the  foregoing  section,  this  section  is  suggestive  rather  than 
definitive.  It  is  also  a  bit  of  a  fact  salad,  since  at  this  stage  the  evidence 
does  not  fit  together  into  a  tight  story  of  how  interactive  vision  works. 
Such  unity  as  does  exist  is  the  result  of  to  the  data's  constituting  evidence 
for  various  interactions  between  so-called  "higher"  and  "lower"  stages  of 
the  visual  system,  and  between  the  visual  and  nonvisual  systems,  (see  also 
Goldman-Rakic  1988;  Van  Hoesen  1993.) 

Connections  from  Motor  Structures  to  Visual  Cortex  Belying  the  as¬ 
sumption  that  the  representation  of  the  visual  scene  is  innocent  of  nonvi¬ 
sual  information,  certain  physiological  data  show  interactive  effects  even 
at  very  early  stages  of  visual  cortex.  For  example,  the  spontaneous  activ¬ 
ity  of  VI  neurons  is  suppressed  according  to  the  onset  time  of  saccades. 
The  suppression  begins  about  20-30  msec  after  the  saccade  is  initiated, 
and  lasts  about  200  msec  (Duffy  and  Burchfield  1975).  The  suppression 
can  be  accomplished  only  by  using  oculomotor  signals,  perhaps  efference 
copy,  and  hence  this  effect  supports  the  interactive  hypothesis.  Neurons 
sensitive  to  eye  position  have  been  found  in  the  LGN  (Lai  and  Friedlan- 
der  1989),  visual  cortical  area  VI  (Trotter  et  al.  1992;  Weyand  and  Malpeli 
1989),  and  V3  (Galleti  and  Battaglini  1989).  Given  the  existence  and  causal 
efficacy  of  various  nonvisual  VI  signals,  Pouget  et  al.  (1993)  hypothesized 
that  visual  features  are  encoded  in  egocentric  (spatiotopic)  coordinates  at 
early  stages  of  visual  processing,  and  that  eye-position  information  is  used 
in  computing  where  in  egocentric  space  the  stimulus  is  located.  Their  net¬ 
work  model  demonstrates  the  feasibility  of  such  a  computation  when  the 
network  takes  as  input  both  retinal  and  eye-position  signals. 

Consider  also  that  a  few  V 1  cells  and  a  higher  percentage  of  V2  cells  show 
an  enhanced  response  to  a  target  to  which  a  saccade  is  about  to  be  made 
(Wurtz  and  Mohler  1976).  Again,  these  data  indicate  some  influence  of 
motor  system  signals,  specifically  motor  planning  signals,  on  cells  in  early 
visual  processing.  As  further  evidence,  note  that  some  neurons  in  V3A 
show  variable  response  as  a  function  of  the  angle  of  gaze;  response  was 
enhanced  when  gaze  was  directed  to  the  contralateral  hemifield  (Galletti 
and  Battaglini  1989). 

Inferior  Parietal  Cortex  and  Eye  Position  Caudal  inferior  parietal  cortex 
(IPL)  has  two  major  subdivisions:  LIP  and  7a  (figure  2.13).  LIP  is  directly 
connected  to  the  superior  colliculus,  the  frontal  eye  fields.  Area  7a  has  a  dif¬ 
ferent  connectivity:  mainly  polymodal  cortex,  limbic,  and  some  prestriate. 
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Figure  2.13  Parcellation  of  inferior  parietal  lobule  and  adjoining  dorsal  aspect  of  the  prelu- 
nate  gyrus.  The  cortical  areas  are  represented  on  flattened  reconstructions  of  the  cortex.  (A) 
Lateral  view  of  the  monkey  hemisphere.  The  darker  line  indicates  the  area  to  be  flattened. 
(B)  The  same  cortex  isolated  from  the  rest  of  the  brain.  The  stippled  areas  are  cortex  buried  in 
sulci,  and  the  blackened  area  is  the  floor  of  the  superior  temporal  sulcus.  The  arrows  indicate 
movement  of  local  cortical  regions  resulting  from  mechanical  flattening.  (C)  The  completely 
flattened  representation  of  the  same  area.  The  stippled  areas  represent  cortical  regions  buried 
in  sulci  and  the  contourlike  lines  are  tracings  of  layer  IV  taken  from  frontal  sections  through 
this  area.  (D)  Locations  of  several  of  the  cortical  fields.  The  dotted  lines  indicate  borders  of 
cortical  fields  that  are  not  precisely  determinable.  IPL,  inferior  parietal  lobule;  IPS,  intrapari- 
etal  sulcus;  LIP,  lateral  intraparietal  region;  STS,  superior  temporal  sulcus.  (Reprinted  with 
permission  from  Andersen  1987). 

LIP  responses  are  correlated  with  execution  of  saccadic  eye  movements; 
area  7a  cells  respond  to  a  stimulus  at  a  certain  retinal  location,  but  mod¬ 
ulated  by  the  position  of  the  eye  in  the  head  (Zipser  and  Andersen  1988). 
Hardy  and  Lynch  (1992)  report  that  both  LIP  and  area  7a  receive  the  ma¬ 
jority  of  their  thalamic  inputs  from  distinct  patches  in  the  medial  pulvinar 
nucleus  (figure  2.14). 

Illusory  Contours  and  Figure  Ground  In  1984  von  der  Heydt  et  al.  re¬ 
ported  that  neurons  in  visual  area  V2  of  the  macaque  will  respond  to  il- 
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case  1 


Figure  2.14  Diagram  of  sections  of  the  thalamus  showing  the  distribution  of  retrogradely 
labeled  neurons  within  the  thalamus  resulting  from  inferior  parietal  lobule  (IPL)  injections  in 
parietal  areas  7a  (open  circles)  and  lateral  intraparietal  region  (LIP)  (filled  circles).  A  single 
injection  (1.2  /x)  of  the  fluorescent  dye  DY  (diamidino-dihydrochloride  yellow)  was  made  in 
7a  and  a  single  injection  (0.5  y)  of  fast  blue  in  LIP.  Each  individual  symbol  denotes  a  singled 
labeled  neuron.  The  densest  labeling  is  in  medial  pulvinar  nucleus  of  the  thalamus  (PM),  with 
LIP  and  7a  showing  a  distinct  projection  pattern.  BrSC,  brachium  of  the  superior  colliculus; 
Cd,  caudate  nucleus;  Cm/Pf,  centromedian  and  parafascicular  nuclei;  GM,  medial  geniculate 
nucleus,  pars  parvicellularis;  H,  habenula;  Lim,  nucleus  limitans;  MD,  mediodorsal  nucleus; 
PAG,  periaqueductal  gray;  PL,  lateral  pulvinar  nucleus;  Ret,  reticular  nucleus.  (Reprinted 
with  permission  from  Hardy  and  Lynch  1992) 

lusory  contours.  More  recently,  Grosof  et  al.  (1992)  report  that  some  ori¬ 
entation  selective  cells  in  VI  respond  to  a  class  of  illusory  contours.  As 
noted  earlier,  these  data  are  consistent  with  the  possibility  that  low-level 
response  depends  on  higher  level  operations  whose  results  are  backpro- 
jected  to  lower  levels.  This  is  lent  plausibility  by  the  facts  that  detection  of 
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illusory  contours  will  depend  on  previous  operations  involving  interpola¬ 
tion  across  a  span  of  the  visual  field. 

Neurons  in  area  MT  respond  selectively  to  direction  of  motion  but  not 
to  wavelength.  Nonetheless,  color  can  have  a  major  effect  on  how  these 
cells  respond  by  virtue  of  how  the  visual  stimulus  is  segmented  (Dobkins 
and  Albright  1993). 

Cross-Modal  Interactions  The  responses  of  cells  in  V4  to  a  visual  stim¬ 
ulus  can  be  modified  by  somatosensory  stimuli  (Maunsell  et  al.  1991). 
Fuster  (1990)  has  shown  similar  task-dependent  modifications  for  cells  in 
somatosensory  cortex,  area  SI. 

Dynamic  Mapping  in  Exotropia 

In  this  section  we  discuss  an  ophthalmic  phenomenon  observed  in  human 
subjects.  Conventionally,  this  is  a  truly  surprising  phenomenon,  and  it 
seems  to  demonstrate  that  processing  as  early  as  VI  can  be  influenced  by 
top-down  factors.  The  phenomenon  has  not  been  well  studied,  to  say  the 
least,  and  much  more  investigation  is  required.  Nevertheless,  we  mention 
it  here  partly  because  it  is  intriguing,  but  mainly  because  if  the  description 
below  is  accurate,  then  we  must  rethink  the  Pure  Vision's  conventional 
assumptions  about  the  Receptive  Field. 

Exotropia  is  a  form  of  squint  in  which  both  eyes  are  used  when  fixated 
on  small  objects  close  by  (e.g.,  12  in  from  the  nose)  but  when  looking  at 
distant  objects,  the  squinting  eye  deviates  outward  by  as  much  as  45°  to  60°. 
Curiously,  the  patient  does  not  experience  double  vision — the  deviating 
eye's  image  is  usually  assumed  to  be  suppressed.  It  is  not  clear,  however,  at 
what  stage  of  visual  processing  the  suppression  occurs. 

Ophthalmologists  have  claimed  that,  contrary  to  expectations,  in  a  small 
subset  of  these  patients,  fusion  occurs  not  only  during  inspection  of  near 
objects,  but  also  when  the  squinting  eye  deviates.  This  phenomenon,  called 
anomalous  retinal  correspondence  or  ARC,  has  been  reported  frequently  in  the 
literature  of  ophthalmology  and  orthoptics.  The  accuracy  of  the  reports, 
and  hence  the  existence  of  ARC,  has  not  always  been  taken  seriously,  since 
ARC  implies  a  rather  breathtaking  lability  of  receptive  fields.  Clinicians 
and  physiologists  raised  in  the  Hubel-Wiesel  tradition  usually  take  it  as 
basic  background  fact  that  (1)  binocular  connections  are  largely  established 
in  area  17  in  early  infancy  and  that  (2)  binocular  fusion  is  based  exclusively 
on  anatomical  correspondence  of  inputs  in  area  17.  For  instance,  if  a  squint 
is  surgically  induced  in  a  kitten  or  an  infant  monkey,  area  17  displays  a 
complete  loss  of  binocular  cells  (and  two  populations  of  monocular  cells) 
but  the  maps  of  the  two  eyes  never  change.  No  apparent  compensation 
such  as  anomalous  correspondence  has  been  observed  in  area  17  and  this 
has  given  rise  to  the  conviction  that  it  is  highly  improbable  that  an  ARC 
phenomenon  truly  exists. 
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On  the  possibility  that  there  might  be  more  to  the  ARC  reports,  Ra¬ 
machandran,  Cobb,  and  Valente  (unpublished)  recently  studied  two  pa¬ 
tients  who  had  intermittent  exotropia.  These  patients  appeared  to  fuse 
images  both  during  near  vision  and  during  far  vision — when  the  left  eye 
deviated  outward — a  condition  called  intermittent  exotropia  with  anomalous 
correspondence.  To  determine  whether  the  patients  do  indeed  have  two  (or 
more)  separate  binocular  maps  of  the  world,  Ramachandran,  Cobb,  and 
Valenti  devised  a  procedure  that  queried  the  alignment  of  the  subject's 
afterimages  where  the  afterimage  for  the  right  eye  was  generated  inde¬ 
pendently  of  the  afterimage  for  the  left  eye.  Here  is  the  procedure:  (1) 
The  subject  (with  squint)  was  asked  to  shut  one  eye  and  to  fixate  on  the 
bottom  of  a  vertical  slit-shaped  window  mounted  on  a  flashgun.  A  flash 
was  delivered  to  generate  a  vivid  monocular  afterimage  of  the  slit.  The 
subject  was  then  asked  to  shut  this  eye  and  view  the  top  of  the  slit  with  the 
other  eye  (and  a  second  flash  was  delivered).  (2)  The  subject  opened  both 
eyes  and  viewed  a  dark  screen,  which  provided  a  uniform  background  for 
the  two  afterimages. 

The  results  were  as  follows:  (1)  The  subject  (with  squint)  reported  seeing 
afterimages  of  the  two  slits  that  were  perfectly  lined  up  with  each  other,  so 
long  as  the  subject  was  deliberately  verging  within  about  arm's  length.  (2) 
On  the  other  hand,  if  the  subject  relaxed  vergence  and  looked  at  a  distant 
wall  (such  that  the  left  eye  deviated),  the  upper  slit  (from  the  anomalous 
eye)  vividly  appeared  to  move  continuously  outward  so  that  the  two  slits 
became  misaligned  by  several  degrees.  Then  this  experiment  was  repeated 
on  two  normal  control  subjects  and  it  was  found  that  no  misalignment  of 
the  slits  occurred  for  any  ordinary  vergence  or  conjugate  eye  movements. 
Nor  could  misalignments  of  the  slits  be  produced  by  passively  displacing 
one  eyeball  in  the  normal  individuals  to  mimic  the  exotropia.  It  appears 
that  eye  position  signals  from  the  deviating  eye  selectively  influence  the 
egocentric  localization  of  points  for  that  eye  alone. 

In  the  next  experiment,  a  light  point  was  flashed  for  150  msec  either  to  the 
right  eye  alone  or  the  left  eye  alone;  the  subject's  task  was  merely  to  point 
to  the  location  of  the  light  point.  Subjects  became  quite  skilled  at  deviating 
their  anomalous  eye  by  between  about  1°  and  40°,  and  the  afterimage  align¬ 
ment  technique  could  be  used  to  calibrate  the  deviation.  Tests  were  made 
with  deviations  between  1°  and  15°.  It  was  found  that  regardless  of  the 
degree  of  deviation  of  the  anomalous  eye,  and  regardless  of  which  eye  was 
stimulated,  subjects  made  only  marginally  more  errors  than  normal  sub¬ 
jects  in  locating  the  light  point.  Is  the  remapping  sufficiently  fine-grained 
to  support  stereopsis?  Testing  for  accuracy  of  stereoptic  judgments  us¬ 
ing  ordinary  stereograms  under  conditions  of  anomalous  eye  deviations 
between  1°  and  12°,  Ramachandran,  Cobb,  and  Valenti  found  that  dispar¬ 
ities  as  small  as  20  min  of  arc  could  be  perceived  correctly  even  though  the 
anomalous  eye  deviated  by  as  much  as  12°.  Even  when  the  half-images  of 
the  two  eyes  were  exciting  noncorresponding  retinal  points  separated  by 
12°,  very  small  retinal  disparities  could  be  detected. 
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Ramachandran  and  his  colleagues  have  dubbed  this  phenomenon  dy¬ 
namic  anomalous  correspondence.  Their  results  suggest  that  something  in  the 
ARC  reports  is  genuine,  with  a  number  of  implications. 

First,  binocular  correspondence  can  change  continuously  in  real  time  in  a 
single  individual  depending  on  the  degree  of  exotropia.  Hence,  binocular 
correspondence  (and  fusion)  cannot  be  based  exclusively  on  the  anatomical 
convergence  of  inputs  in  area  17.  The  relative  displacement  observed  be¬ 
tween  the  two  afterimages  also  implies  that  the  local  sign  of  retinal  points 
(and  therefore  binocular  correspondence)  must  be  continuously  updated 
as  the  eye  deviates  outward. 

Second,  since  the  two  slits  would  always  be  lined  up  as  far  as  area  17  is  con¬ 
cerned,  the  observed  misalignment  implies  that  feedback  (or  feedforward) 
signals  from  the  deviating  eye  must  somehow  be  extracted  separately  for 
each  eye  and  must  then  influence  the  egocentric  location  of  points  selec¬ 
tively  for  that  eye  alone.  This  is  a  somewhat  surprising  result,  for  it  implies 
that  time  remapping  of  egocentric  space  must  be  done  very  early — before 
the  eye  of  origin  label  is  lost — i.e.,  before  the  cells  become  completely  binoc¬ 
ular.  Since  most  cells  anterior  to  area  18  (e.g.,  MT  or  V4)  are  symmetrically 
binocular  we  may  conclude  that  the  correction  must  involve  interaction 
between  reafference  signals  and  the  output  of  cells  as  early  as  17  or  18. 

Nothing  in  the  psychophysical  results  suggests  what  the  mechanism 
might  be  by  which  these  interactions  occur.  Whatever  the  ultimate  ex¬ 
planation,  however,  the  results  do  imply  that  even  as  simple  a  perceptual 
process  as  the  localization  of  an  object  in  X/Y  coordinates  is  not  strictly  and 
absolutely  a  bottom-up  process.  Even  the  output  of  early  visual  elements — 
in  this  case  the  monocular  cells  of  area  17 — can  be  strongly  modulated  by 
back  projections  from  eye  movement  command  centers. 

If  indeed  a  complete  remapping  of  perceptual  space  can  occur  selec¬ 
tively  for  one  eye's  image  simply  in  the  interest  of  preserving  binocular 
correspondence,  this  is  a  rather  remarkable  phenomenon.  It  would  be  in¬ 
teresting  to  see  if  this  remapping  process  can  be  achieved  by  algorithms  of 
the  type  proposed  by  Zipser  and  Andersen  (1988)  for  parietal  neurons  or 
by  shifter-circuits  of  the  kind  proposed  by  Anderson  and  Van  Essen  (1987; 
see  also  chapter  13). 

COMPUTATIONAL  ADVANTAGES  OF  INTERACTIVE  VISION 

So  far  we  have  discussed  various  empirical  data  that  lend  some  credibility 
to  an  interactive-vision  approach.  But  the  further  question  is  this:  Does  it 
make  sense  computationally  for  a  nervous  system  to  have  an  interactive 
style  rather  than  a  hierarchical,  modular,  modality-pure,  and  motorically 
unadulterated  organization?  In  this  section,  we  briefly  note  four  reasons, 
based  on  the  computational  capacities  of  neural  net  models,  why  evolution 
might  have  selected  the  interactive  modus  operandi  in  nervous  systems. 
As  more  computer  models  in  the  interactive  style  are  developed  and  ex¬ 
plored,  additional  factors,  for  or  against,  may  emerge.  The  results  from 
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neural  net  models  also  suggest  experiments  that  could  be  run  on  real  ner¬ 
vous  systems  to  reveal  whether  they  are  in  fact  computationally  interactive. 

Figure-Ground  Segmentation  and  Recognition  Are  More  Efficiently 
Achieved  in  Tandem  Than  Strictly  Sequentially  Segmentation  is  a  dif¬ 
ficult  task,  especially  when  there  are  many  objects  in  a  scene  partially 
occluding  one  another.  The  problem  is  essentially  that  global  information 
is  needed  to  make  decisions  at  the  local  level  concerning  what  goes  with 
what.  At  lower  levels  of  processing  such  as  VI,  however,  the  receptive 
fields  are  relatively  small  and  it  is  not  possible  locally  to  decide  which 
pieces  of  the  image  belong  together.  If  lower  levels  can  use  information 
that  is  available  at  higher  levels,  such  as  representation  of  whole  objects, 
then  feedback  connections  could  be  used  to  help  tune  lower  levels  of  pro¬ 
cessing.  This  may  sound  like  a  chicken-and-egg  proposal,  for  how  can  you 
recognize  an  object  before  you  segment  it  from  its  background?  Just  as  the 
right  answer  to  the  problem  "where  does  the  egg  come  from"  is  "an  earlier 
kind  of  chicken,"  so  here  the  the  answer  is  "use  partial  segmentation  to  help 
recognize,  and  use  partial  recognition  to  help  segment."  Indeed,  interac¬ 
tive  segmentation-recognition  may  enable  solutions  that  would  otherwise 
be  unreachable  in  short  times  by  pure  bottom-up  processing. 

It  is  worth  considering  the  performance  of  machine  reading  of  numerals. 
The  best  of  the  "pure  vision"  configured  machines  can  read  numerals  on 
credit  card  forms  only  about  60%  of  the  time.  They  do  this  well  only  because 
the  sales  slip  "exactifies"  the  data:  numerals  must  be  written  in  blue  boxes. 
This  serves  to  separate  the  numerals,  guarantee  an  exact  location,  and 
narrowly  limit  the  size.  Carver  Mead  (in  conversation)  has  pointed  out 
that  the  problem  of  efficient  machine  reading  of  zip  codes  is  essentially 
unsolved,  because  the  preprocessing  regimentations  for  numeral  entry  on 
sales  slips  do  not  exist  in  the  mail  world.  Here  the  machine  readers  have  to 
face  the  localization  problem  (where  are  the  numerals  and  in  what  order?) 
and  the  segmentation  problem  (what  does  a  squiggle  belong  to?)  as  well 
as  the  recognition  problem  (is  it  a  0  or  a  6?). 

Conventional  machines  typically  serialize  the  problem,  addressing  first 
the  segmentation  problem  and  then,  after  that  is  accomplished,  addressing 
the  recognition  problem.  Should  the  machine  missolve  or  fail  to  solve  the 
first,  the  second  is  doomed.  In  the  absence  of  strict  standardization  of 
location,  font,  size,  relation  to  other  numerals,  relation  of  zip  code  to  other 
lines,  and  so  forth,  classical  machines  regularly  fumble  the  segmentation 
problem.  Unlike  engineers  working  with  the  strictly  serial  problem-design. 
Carver  Mead  and  Federico  Faggin  (in  conversation)  have  found  that  if 
networks  can  address  segmentation  and  recognition  in  parallel,  they  well 
outperform  their  serial  competitors. 

The  processing  of  visual  motion  is  another  example  of  how  segregation 
may  proceed  in  parallel  with  visual  integration.  Consider  the  problem  of 
trying  to  track  a  bird  flying  through  branches  of  a  tree;  at  any  moment 
only  parts  of  the  bird  are  visible  through  the  occluding  foliage,  which  may 
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itself  be  moving.  The  problem  is  to  identify  fleeting  parts  of  the  bird  that 
may  be  combined  to  estimate  the  average  velocity  of  the  bird  and  to  keep 
this  information  separate  from  information  about  the  tree.  This  is  a  global 
problem  in  that  no  small  patch  of  the  visual  field  contains  enough  infor¬ 
mation  to  unambiguously  solve  the  segregation  problem.  However,  area 
MT  of  the  primate  visual  cortex  has  neurons  that  seem  to  have  "solved" 
this  segregation  problem.  A  recent  model  of  area  MT  that  includes  two 
parallel  streams,  one  that  selects  regions  of  the  visual  field  that  contain  reli¬ 
able  motion  information,  and  another  that  integrates  information  from  that 
region,  exhibits  properties  similar  to  those  observed  in  area  MT  neurons 
(Nowlan  and  Sejnowski  1993).  This  model  demonstrates  that  segmenta¬ 
tion  and  integration  can  to  some  extent  be  performed  in  parallel  at  early 
stages  of  visual  processing. 

It  would  not  be  surprising  if  evolution  found' the  interactive  strategy 
good  for  brains.  So  long  as  the  segmentation  problem  is  partially  solved, 
a  good  answer  can  be  dumped  out  of  the  visual  "pipeline"  very  quickly. 
When,  however,  the  task  is  more  difficult,  iterations  and  feedback  may  be 
essential  to  drumming  up  an  adequate  solution.  To  speed  up  processing 
in  the  difficult  cases — which  will  be  the  rule,  not  the  exception,  in  real- 
world  vision — the  system  may  avail  itself  of  learning.  If,  after  frequent 
encounters,  the  brain  learns  that  certain  patterns  typically  go  together, 
thereafter  the  number  of  iterations  needed  to  find  an  adequate  solution  is 
reduced  (Sejnowski  1986).  Humans  probably  "overleam"  letter  and  word 
patterns,  and  hence  seasoned  readers  are  faster  and  more  accurate  than 
novice  readers.  Even  when  text  is  degraded  or  partially  occluded,  a  good 
reader  may  hardly  stumble. 

Movement  (of  Eye,  Head,  Body)  Makes  Many  Visual  Computations 
Simpler  A  number  of  reasons  support  this  point.  First,  the  smooth  pur¬ 
suit  system  for  tracking  slowly  moving  objects  supports  image  stability 
on  the  retina,  simplifying  the  tasks  of  analyzing  and  recognizing.  Second, 
head  movement  during  eye  fixation  yields  cues  useful  in  the  task  of  sep¬ 
arating  figure  from  ground  and  distinguishing  one  object  from  another. 
Motion  parallax  (the  relative  displacement  of  objects  caused  by  change  in 
observer  position)  is  perhaps  the  most  powerful  cue  to  the  relative  depth 
of  objects  (closer  objects  have  greater  relative  motion  than  more  distant 
objects),  and  it  continues  to  be  critical  for  relative  depth  judgments  even 
beyond  about  10  m  from  the  observer,  where  stereopsis  fades  out.  Head 
bobbing  is  common  behavior  in  animals,  and  a  visual  system  that  integrates 
across  several  glimpses  to  estimate  depth  has  computational  savings  over 
one  that  tries  to  calculate  depth  from  a  single  snapshot. 

Another  important  cue  is  optical  flow  (figure  2.15).  When  an  animal  is 
running,  flying,  or  swimming,  for  example,  the  speed  of  an  image  mov¬ 
ing  radially  on  the  retina  is  related  to  the  distance  of  the  object  from  the 
observer.4  This  information  allows  the  system  to  figure  out  how  fast  it  is 
closing  in  on  a  chased  object,  as  well  as  how  fast  a  chasing  object  is  closing 
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Figure  2.15  Optical  flow  represented  by  a  vector  field  around  a  flying  bird  provides  infor¬ 
mation  about  self-movement  through  the  environment.  (From  Gibson  1966) 

in.  Notice  that  any  of  these  movements  (eyeball  head,  and  whole  body)  on 
its  own  means  computational  economies.  In  combination,  the  economies 
compound. 

There  are  many  more  example  of  how  the  self-generated  movement  can 
provide  solutions  to  otherwise  intractable  problems  in  vision  (Ballard  1991; 
Blake  and  Yuille  1992). 

The  Self-Organization  of  Model  Visual  Systems  during  "Development" 
Is  Enhanced  by  Eye-Position  Signals  An  additional  advantage  of  inter¬ 
active  vision  is  its  role  in  the  construction  of  vision  systems.  Researchers 
in  computer  vision  often  reckon — and  bemoan— the  cost  of  "hand"  build¬ 
ing  vision  systems,  but  rarely  consider  the  possibility  of  growing  a  visual 
system.  Nature,  of  course,  uses  the  growing  strategy,  and  relies  on  ge¬ 
netic  instructions  to  create  neurons  with  the  right  set  of  components.  In 
addition,  interactions  between  neurons  as  well  as  interactions  between  the 
world  and  neurons,  are  critical  in  getting  networks  of  neurons  properly 
wired  up.  Understanding  the  development  of  the  brain  is  perhaps  as  chal¬ 
lenging  a  problem  as  that  of  understanding  the  function  of  the  brain,  but 
we  are  beginning  to  figure  out  some  of  the  relevant  factors,  such  as  posi¬ 
tion  cues,  timing  of  gene  expression,  and  activity-dependent  modifications. 
Genetic  programming  has  been  explored  as  an  approach  to  solving  some 
construction  problems,  but  value  of  development  as  an  an  intermediary 
between  genes  and  phenotype  is  only  beginning  to  be  appreciated  in  the 
computational  community  (Belew  1993). 
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Most  activity-dependent  models  of  development  are  based  on  the  Hebb 
rule  for  synaptic  plasticity,  according  to  which  the  synapse  strengthens 
when  the  presynaptic  activity  is  correlated  with  the  postsynaptic  activity 
(Sejnowski  and  Tesauro  1989).  Correlation-based  models  for  self-organi¬ 
zation  of  primary  visual  cortex  during  development  have  shown  that  some 
properties  of  cortical  cells,  such  as  ocularity,  orientation,  and  disparity,  can 
emerge  from  simple  Hebbian  mechanisms  for  synaptic  plasticity  (Swin- 
dale  1990;  Linsker  1986;  Miller  and  Stryker  1990;  Berns  et  al.  1993).  Heb¬ 
bian  schemes  are  typically  limited  in  their  computational  power  to  finding 
the  principal  components  in  the  input  correlations.  It  has  been  difficult 
to  extend  this  approach  to  a  hierarchy  of  increasingly  higher-order  re¬ 
sponse  properties,  as  found  in  the  extrastriate  areas  of  visual  cortex.  One 
new  approach  is  based  on  the  observation  that  development  takes  place 
in  stages.  There  are  critical  periods  during  which  synapses  are  particu¬ 
larly  plastic  (Rauschecker  1991),  and  there  are  major  milestones,  such  as 
eye  opening,  that  change  the  nature  of  the  input  correlations  (Bems  et  al. 
1993). 

Nature  exploits  additional  mechanisms  in  the  developing  brain  to  help 
organize  visual  pathways.  One  important  class  of  mechanisms  is  based 
on  the  interaction  between  self-generated  actions  and  perception,  along 
the  lines  already  discussed  in  the  previous  section.  Eye  movement  in¬ 
formation  in  the  visual  cortex  during  development,  when  combined  with 
Hebbian  plasticity,  may  be  capable  of  extracting  higher-order  correlations 
from  complex  visual  inputs.  The  correlation  between  eye  movements  and 
changes  in  the  image  contains  information  about  important  visual  proper¬ 
ties.  For  example,  correlation  of  saccadic  eye  movements  with  the  response 
of  a  neuron  can  be  used  in  a  Hebbian  framework  to  develop  neurons  that 
respond  to  the  direction  of  motion.  At  still  higher  levels  of  processing, 
eye  movement  signals  that  direct  saccades  to  salient  objects  can  be  used 
as  a  reward  signal  to  build  up  representations  of  significant  objects  (Mon¬ 
tague  et  al.  1993).  This  new  view  provides  eye-movement  signals  with  an 
important  function  in  visual  cortex  both  during  development  and  in  the 
adult. 

The  plasticity  of  the  visual  cortex  during  the  critical  period  is  modulated 
by  inputs  from  subcortical  structures  that  project  diffusely  throughout  the 
cortex  (Rauschecker  1991).  The  neurotransmitters  used  by  these  systems 
often  diffuse  from  the  release  sites  and  act  at  receptors  on  neurons  some 
distance  away.  These  diffuse  ascending  systems  to  the  cerebral  cortex  that 
are  used  during  development  to  help  wire  up  the  brain  are  also  used  in 
the  adult  for  signaling  reward  and  salience.  The  information  carried  by 
neurons  in  these  systems  is  rather  limited:  there  are  relatively  few  neurons 
compared  to  the  number  in  the  cortex,  they  have  a  low  basal  firing  rate, 
and  changes  in  their  firing  rates  occur  slowly.  This  is,  however,  just  the 
sort  of  information  that  could  be  used  to  organize  and  regulate  information 
storage  throughout  the  brain,  as  shown  below. 
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Interactive  Perception  Simplifies  the  Learning  Problem  A  difficulty  fac¬ 
ing  conventional  reinforcement  learning  is  this:  Assuming  the  brain  creates 
and  maintains  a  picture-perfect  visual  scene  at  each  moment,  how  does  the 
brain  determine  which,  among  the  many  features  and  objects  it  recently 
perceived,  are  the  ones  relevant  to  the  reward  or  punishment?  An  expe¬ 
rienced  animal  will  have  a  pretty  good  idea,  but  how  does  its  experience 
get  it  to  that  stage?  How  does  the  naive  brain  determine  which  "stimulus" 
in  the  richly  detailed  stimulus  array  gets  the  main  credit  when  a  certain 
response  brings  a  reward?  How  does  the  brain  know  what  synapses  to 
strengthen? 

This  "relevance  problem"  is  even  more  vexing  as  there  are  increases  in 
the  time  delay  between  the  stimulus  and  the  reinforcement.  For  then  the 
stimulus  array  develops  over  time,  getting  richer  and  richer  as  time  passes. 
The  correlative  problem  of  knowing  which  movement  among  many  move¬ 
ments  made  was  the  relevant  one  is  likewise  increasingly  difficult  as  the 
delay  increases  between  the  onset  of  various  movements  and  the  reward¬ 
ing  or  punishing  outcome.5  These  questions  involve  considerations  that  go 
well  beyond  the  visual  system,  and  include  parts  of  the  brain  that  evaluate 
sensory  inputs. 

Suppose  that  evolution  has  wired  the  brain  to  bias  attention  as  a  func¬ 
tion  of  how  the  species  makes  its  living,  and  that  the  neonate  is  tuned  to 
attend  to  some  basic  survival-relevant  properties.  The  evolutionary  point 
legitimates  the  assumption  that  an  attended  feature  of  the  stimulus  scene 
is  more  likely  to  be  causally  implicated  in  producing  the  conditions  for  the 
reward,  and  assuming  that  the  items  in  iconic  and  working  memory  are,  by 
and  large,  items  previously  attended  to,  then  the  number  of  candidate  rep¬ 
resentations  to  canvas  as  "relevant"  is  far  smaller  than  those  embellishing 
a  rich-replica  visual  world  representation.  Granted  all  these  assumptions, 
the  credit  assignment  problem  is  far  more  manageable  here  than  in  the 
pure  vision  theoretical  framework  (Ballard  1991). 

By  narrowing  the  number  of  visuomotor  trajectories  that  count  as  salient, 
attention  can  bias  the  choice  of  synapses  strengthen.  Selective  strengthen¬ 
ing  of  synapses  of  certain  visuomotor  representations  "spotlit"  by  attention 
is  a  kind  of  hypothesis  the  network  makes.  It  is,  moreover,  an  hypothesis 
the  network  tests  by  repeating  the  visuomotor  trajectory.  Initially  the  net¬ 
work  will  shift  attention  more  or  less  randomly,  save  for  guidance  from 
startle  responses  and  other  reflexive  behavior.  Given  that  attention  down¬ 
sizes  the  options,  and  that  the  organism  can  repeatedly  explore  the  various 
options,  the  system  learns  to  direct  attention  to  visual  targets  that  it  has 
learned  are  "good  bets"  in  the  survival  game.  This,  in  turn,  contributes 
to  further  simplifying  the  learning  problem  in  the  future,  for  on  the  next 
encounter,  attention  will  more  likely  be  paid  to  relevant  features  than  to 
irrelevant  features,  and  the  connections  can  be  up-regulated  or  down-  reg¬ 
ulated  as  a  result  of  reward  or  lack  of  same  (the  above  points  are  from 
Whitehead  and  Ballard  1990, 1991;  see  also  Grossberg  1987). 
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LEARNING  TO  SEE 


A  robust  property  of  animal  learning  is  that  responses  reinforced  by  a  re¬ 
ward  are  likely  to  be  produced  again  when  relevantly  similar  conditions 
obtain.  This  is  the  starting  point  for  behavioral  studies  of  operant  con¬ 
ditioning  in  psychology  (Rescorla  and  Wagner  1972;  Mackintosh  1974), 
neuroscientific  inquiry  into  the  reward  systems  of  the  brain  (Wise  1982) 
and  engineering  exploration  of  the  principles  and  applications  of  reinforce¬ 
ment  learning  theory  (Sutton  and  Barto  1981, 1987).  Both  neuroscience  and 
computer  engineering  draw  on  the  vast  and  informative  psychological  lit¬ 
erature  describing  the  various  aspects  of  reinforcement  learning,  including 
such  phenomena  as  blocking,  extinction,  intermittent  versus  constant  re¬ 
ward,  cue  ranking,  how  time  is  linked  to  other  cues,  and  so  forth.  The 
overarching  aim  is  that  the  three  domains  of  experimentation  will  link  up 
and  yield  a  unified  account  of  the  scope  and  limits  of  the  capacity  and  of 
its  underlying  mechanisms  (see  Whitehead  and  Ballard  1990;  Montague  et 
al.  1994). 

Detailed  observations  of  animal  foraging  patterns  under  well  quanti¬ 
fied  conditions  indicate  that  animals  can  display  remarkably  sophisticated 
adaptive  behavior.  For  example,  birds  and  bees  quickly  adopt  the  most  ef¬ 
ficient  foraging  pattern  in  "two-armed  bandit"  conditions  (a:  high-payoff 
when  a  "hit"  and  "hits"  are  infrequent;  b:  low  payoff  when  a  "hit"  and 
"hits"  are  frequent)  (see  Krebs  et  al.  1978;  Gould  1984;  Real  et  al.  1990;  Real 
1991).  Cliff-dwelling  rooks  learn  to  bombard  nest-marauders  with  pebbles 
(Griffin  1984).  A  bear  learns  that  a  bluff  of  leafy  trees  in  a  hill  otherwise 
treed  with  pines  means  a  gully  with  a  creek,  and  a  creek  means  rocks  under 
which  crawfish  are  often  living,  and  that  means  tasty  dinner. 

The  questions  posed  in  the  previous  section  concerning  reinforcement 
learning,  along  with  the  dearth  of  obvious  answers,  have  moved  some 
cognitive  psychologists  (e.g.,  Chomsky  1965,  1980;  Fodor  1981)  to  con¬ 
clude  that  reinforcement  learning  cannot  be  a  serious  contender  for  the 
sophisticated  learning  typical  of  cognitive  organisms.  Further  skepticism 
concerning  reinforcement  learning  as  a  cognitive  contender  derives  from 
neural  net  modeling.  Here  the  results  shows  that  neural  nets  trained  up 
by  available  reinforcement  procedures  scale  poorly  with  the  number  of  di¬ 
mensions  of  the  input  space.  In  other  words,  as  a  net's  visual  representation 
approximates  a  rich  replica  of  the  real  world,  the  training  phase  becomes 
unrealistically  long.  Consequently,  computer  engineers  often  conclude 
that  reinforcement  learning  is  impractical  for  most  complex  task  domains. 
According  to  some  cognitive  approaches  (Fodor,  Pylyshyn,  etc.),  suitable 
learning  theories  must  be  "essentially  cognitive,"  meaning,  roughly,  that 
cognitive  learning  consists  of  logic-like  transformations  over  language¬ 
like  representations.  Moreover,  the  theory  continues,  such  learning  is  irre¬ 
ducible  to  neurobiology.  (For  a  fuller  characterization  and  criticism  of  this 
view,  see  Churchland  1986.) 

By  contrast,  our  hunch  is  that  much  cognitive  learning  may  well  turn  out 
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to  be  explainable  as  reinforcement  learning  once  the  encompassing  details 
of  the  rich-replica  assumption  no  longer  inflate  the  actual  magnitude  of 
the  " relevance  problem." 

Natural  selection  and  reinforcement  learning  share  a  certain  scientific  ap¬ 
peal;  to  wit,  neither  presupposes  an  intelligent  homunculus,  an  omniscient 
designer,  or  a  miraculous  force — both  are  naturalistic,  as  opposed  to  super- 
naturalistic.  They  also  share  reductionist  agendas.  Thus,  as  a  macrolevel 
phenomenon,  reinforcement  learning  behavior  is  potentially  explainable  in 
terms  of  micromechanisms  at  the  neuronal  level.  And  we  are  encouraged  to 
think  so  because  the  Hebbian  approach  to  mechanisms  for  synaptic  modifi¬ 
cation  underlying  reinforcement  learning  looks  very  plausible.  These  gen¬ 
eral  considerations,  in  the  context  of  the  data  discussed  earlier,  suggest  that 
the  skepticism  concerning  the  limits  of  reinforcement  learning  should  re¬ 
ally  be  relocated  to  the  background  assumption — the  rich-replica  assump¬ 
tion.  Consequently,  the  question  guiding  the  following  discussion  is  this: 
What  simplifications  in  the  learning  problem  can  be  achieved  by  abandon¬ 
ing  Pure  Vision's  rich-replica  assumption?  How  much  mileage  can  we  get 
out  of  the  reinforcement  learning  paradigm  if  we  embrace  the  assumption 
that  the  perceptual  representations  are  semiworld  representations  consist¬ 
ing  of,  let  us  say,  goal-relevant  properties?  How  might  that  work? 

Using  an  internal  evaluation  system,  the  brain  can  create  predictive  se¬ 
quences  by  rewarding  behavior  that  leads  to  conditions  that  in  turn  permit 
a  further  response  that  will  produce  an  external  reward,  that  is,  sequences 
where  one  feature  is  a  cue  for  some  other  event,  which  in  turn  is  a  cue 
for  a  further  event,  which  is  itself  a  cue  for  a  reward.  To  get  the  fla¬ 
vor,  suppose,  for  example,  a  bear  cub  chances  on  crawfish  under  rocks 
in  a  creek,  whereupon  the  crawfish /rocks-in-water  relationship  will  be 
strengthened.  Looking  under  rocks  in  a  lake  produces  no  crawfish,  so 
the  crawfish /rocks-in-lake  relationship  does  not  get  strengthened,  but  the 
crawfish /rocks-in-creek  relationship  does.  Finding  a  creek  in  a  leafy-tree 
gully  allows  internal  diffusely-projecting  modulatory  systems,  such  as  the 
dopamine  system,  to  then  reward  associations  between  creeks  and  leafy- 
trees-in-gullies,  even  in  the  absence  of  a  external  reward  between  creeks 
and  leafy-trees-in-gullies. 

Given  such  an  internal  reward  system,  the  brain  can  build  a  network 
replete  with  predictive  representations  that  inform  attention  as  to  what  is 
worth  looking  at  given  one's  interests  ("that  big  dead  tall  tree  will  proba¬ 
bly  have  hollows  in  it,  and  there  will  probably  be  a  blue-bird's  nest  in  one 
cavity,  and  that  nest  might  have  eggs  and  I  will  get  eggs  to  eat").  To  a  first 
approximation,  a  given  kind  of  animal  comes  to  have  an  internal  model 
of  its  world;  that  is,  of  its  relevant-to-my-life-style  world,  as  opposed  to  a 
world-with-all-its-perceptual-properties.  For  bears,  this  means  attending 
to  creeks  and  dead  trees  when  foraging,  and  not  noticing  much  in  any¬ 
thing  about  rocks  at  lake  edges,  or  sunflowers  in  a  meadow.  All  of  which 
then  makes  subsequent  reinforcement  tasks  and  the  delimiting  of  what  is 
relevant  that  much  easier.  (To  echo  the  school  marm's  saw,  the  more  you 
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know  about  the  world,  the  better  the  questions  you  can  ask  of  it  and  the 
faster  you  learn.) 

A  neural  network  model  of  predictive  reinforcement  learning  in  the  brain 
roughly  based  on  a  diffuse  neurotransmitter  system  has  been  applied  to 
the  adaptive  behavior  of  foraging  bumble  bees  (Montague  et  al.  1994). 
This  is  an  especially  promising  place  to  test  the  semiworld  hypothesis,  for 
it  is  an  example  in  which  both  the  sensory  input  and  the  motor  output  can 
be  quantified,  the  animal  gets  quantifiable  feedback  (sugar  reward) ,  and 
something  of  the  physiology  of  the  reward  system,  the  motor  system,  and 
the  visual  system  in  the  animal's  brain  has  been  explored.  Furthermore,  bee 
foraging  behavior  has  been  carefully  studied  by  several  different  research 
groups,  and  there  are  lots  of  data  available  to  constrain  a  network  model. 

Bees  decide  which  flowers  to  visit  according  to  past  success  at  gathering 
nectar,  where  nectar  volume  varies  stochastically  from  flower  to  flower 
(Real  1991).  The  cognitive  characterization  of  the  bees'  accomplishments 
involves  applications  of  computational  rules  over  representations  of  the 
arithmetic  mean  of  rewards  and  variance  in  reward  distributions.  On  the 
other  hand,  according  to  the  Dayan-Montague  reinforcement  hypothesis 
(Montague  et  al.  1994),  when  a  bee  lands  on  a  flower,  the  actual  reward 
value  of  the  nectar  collected  by  the  bee  is  compared  (more?  or  less?  or 
right  on?)  with  the  reward  that  its  brain  had  predicted,  and  the  differ¬ 
ence  is  used  to  improve  the  prediction  of  future  reward  using  predictive 
Hebbian  synapses.  Dayan  and  Montague  propose  that  the  very  same  pre¬ 
dictive  network  is  used  to  bias  the  actions  of  the  bee  in  choosing  flowers. 
Using  this  nonhomuncular,  nondivine,  naturalistic  learning  procedure,  the 
model  network  accurately  mimics  the  foraging  behavior  of  real  bumble¬ 
bees  (figure  2.16). 

That  such  a  simple,  "dumb"  organization  can  account  for  the  appar¬ 
ent  statistical  cunning  of  bumblebees  is  encouraging,  for  it  rewards  the 
hunch  that  much  more  can  be  got  out  of  a  reinforcement  learning  paradigm 
once  the  "pure  vision"  assumption  is  replaced  by  the  "interactive-vision- 
cum-predictive-learning"  assumption.  As  we  contemplate  extending  the 
paradigm  from  bees  to  primates,  it  is  also  encouraging  that  similar  diffuse 
neurotransmitter  systems  are  found  in  primates  where  there  is  evidence 
that  some  of  them  are  involved  in  predicting  rewards  (Ljunberg  et  al.  1992). 

Bees  successfully  forage,  orient,  fly,  communicate,  houseclean  and  so 
forth — and  do  it  all  with  fewer  than  106  neurons  (Sejnowski  and  Church- 
land  1992).  Human  brains,  by  contrast,  are  thought  to  have  upward  of  1012 
neurons.  Although  an  impressively  long  evolutionary  distance  stretches 
between  insects  and  mammals,  what  remains  constant  is  the  survival  value 
of  learning  cues  for  food,  cues  for  predators  and  so  forth.  Consequently, 
conservation  of  the  diffuse,  modulatory,  internal  reward  system  makes 
good  biological  sense.  What  is  sensitive  to  the  pressure  of  natural  selec¬ 
tion  is  additional  processors  that  permit  increasingly  subtle,  fine-grained, 
and  long-range  predictions — always,  of  course,  relevant-to-my-thriving 
predictions.  That  in  turn  may  entail  making  better  and  better  classifications 
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Figure  2.16  Neural  architecture  for  a  model  of  bee  foraging.  Predictions  about  future  ex¬ 
pected  reinforcement  are  made  in  the  brain  using  a  diffuse  neurotransmitter  system.  Sensory 
input  drives  the  units  B  and  Y  representing  blue  and  yellow  flowers.  These  units  project 
to  a  reinforcement  neuron  P  through  a  set  of  plastic  weights  (filled  circles  we  and  w^)  and 
to  an  action  selection  system.  S  provides  input  to  R  and  fires  while  the  bee  sips  the  nectar. 
R  projects  its  output  17  through  a  fixed  weight  to  P.  The  plastic  weights  onto  P  implement 
predictions  about  future  reward  and  P's  output  is  sensitive  to  temporal  changes  in  its  input. 
The  outputs  of  P  influence  learning  and  also  the  selection  of  actions  such  as  steering  in  flight 
and  landing.  Lateral  inhibition  (dark  circle)  in  the  action  selection  layer  performs  a  winner- 
takes-all.  Before  encountering  a  flower  and  its  nectar,  the  output  of  P  will  reflect  the  temporal 
difference  only  between  the  sensory  inputs  B  and  Y.  During  an  encounter  with  a  flower  and 
nectar,  the  prediction  error  6t  is  determined  by  the  output  of  B  or  Y  and  R,  and  learning  occurs 
at  connections  wB  and  wY.  These  strengths  are  modified  according  to  the  correlation  between 
presynaptic  activity  and  the  prediction  error  6t  produced  by  neuron  P.  Before  encountering 
a  flower  and  its  nectar,  the  output  of  P  will  reflect  the  temporal  difference  only  between  the 
sensory  inputs  B  and  Y.  During  an  encounter  with  a  flower  and  nectar,  the  prediction  error  <5/ 
is  determined  by  the  output  of  B  or  Y  and  R,  and  learning  occurs  at  connections  ivB  and  zuY . 
These  strengths  are  modified  according  to  the  correlation  between  presynaptic  activity  and 
the  prediction  error  6{  produced  by  neuron  P.  Simulations  of  this  model  account  for  a  wide 
range  of  observations  of  bee  preference,  including  aversion  for  risk.  (From  Montague  et  al. 
1994) 

(relative  to  the  animals'  lifestyle),  as  well  as  more  efficient  and  predictively 
sound  generalizations  (relative  to  the  animals'  life-style). 

To  a  first  approximation,  cortical  enlargement  was  driven  by  the  com¬ 
petitive  advantage  accruing  to  brains  with  fancier,  good-for-me-and-my- 
kin  predictive  prowess,  where  the  structures  performing  those  functions 
would  have  to  be  knit  into  the  reward  system.  Some  brand  new  represen¬ 
tational  mechanisms  may  also  have  been  added,  but  the  increased  "intel¬ 
ligence"  commonly  associated  with  increased  size  of  the  cortical  mantle 
may  be  a  function  chiefly  of  greater  predictive-goal-relevant  representa¬ 
tional  power,  not  to  greater  representational  power  per  se.  Whether  some 
property  of  the  world  is  visually  represented  depends  on  the  represen¬ 
tation's  utility  in  the  predictive  game,  and  for  this  to  work,  the  cortical 
representational  structures  must  be  plastic  and  must  be  robustly  tethered 
to  the  diffusely  projecting  systems.  World-perfect  replicas,  unhitched  from 
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the  basic  engines  of  reward  and  punishment,  are  probably  more  of  a  liabil¬ 
ity  than  an  advantage — they  are  likely  to  be  time-wasters,  space-wasters, 
and  energy-wasters. 

On  this  approach,  various  contextual  aspects  of  visual  perception,  such 
as  filling  in,  seeing  the  dot  move  behind  the  occluder,  cross-modal  effects, 
and  plasticity  in  exotropia,  can  be  understood  as  displaying  the  predictive 
character  of  cortical  processing. 

CONCLUDING  REMARKS 

A  well-developed  geocentric  astronomy  was  probably  an  inevitable  fore¬ 
runner  to  modem  astronomy.  One  has  to  start  with  what  seems  most  secure 
and  build  from  there.  The  apparent  motionless  of  the  earth,  the  fixity  of 
the  stars,  and  the  retrograde  motions  of  the  planets  were  the  accessible 
and  seemingly  secure  "observations"  that  grounded  theorizing  about  the 
nature  of  the  heavens.  Such  were  the  first  things  one  saw — saw  as  system¬ 
atically  and  plainly  as  one  saw  anything.  The  geocentric  hypothesis  also 
provided  a  framework  for  the  very  observations  that  eventually  caused  it 
to  be  overhauled. 

In  something  like  the  same  way,  the  Theory  of  Pure  Vision  is  probably 
essential  to  understanding  how  we  see,  even  if,  as  it  seems,  it  is  a  ladder 
we  must  eventually  kick  out  from  under  us.  The  accessible  connectivity 
suggests  a  hierarchy,  the  most  accessible  and  salient  temporal  sequence 
is  sensory  input  to  the  transducers  followed  by  output  from  the  muscles, 
the  accessible  response  properties  of  single  cells  show  simple  specificities 
nearer  the  periphery  and  greater  complexity  the  further  from  the  periphery, 
and  so  on.  Such  are  the  grounding  observations  for  a  hierarchical,  modular, 
input-output  theory  of  how  we  see. 

But  there  are  nagging  observations  suggesting  that  the  brain  is  only 
grossly  and  approximately  hierarchical,  that  input  signals  from  the  sensory 
periphery  are  only  a  part  of  what  drives  "sensory"  neurons,  that  ostensi¬ 
bly  later  processing  can  influence  earlier  processes;  that  motor  business 
can  influence  sensory  business,  that  processing  stages  are  not  much  like 
assembly  line  productions,  that  connectivity  is  nontrivially  back  as  well 
as  forward,  etc.  Some  phenomena,  marginalized  within  the  Pure  Vision 
framework,  may  be  accorded  an  important  function  in  the  context  of  a  het¬ 
erarchical,  interactive,  space-critical  and  time-critical  theory  of  how  we  see. 
Consider,  for  example,  spontaneous  activity  of  neurons,  so-called  "noise" 
in  neuronal  activity,  nonclassical  receptive  field  properties,  visual  system 
learning,  attentional  bottlenecks,  plasticity  of  receptive  field  properties, 
time-dependent  properties,  and  backprojections. 

Obviously  visual  systems  evolved  not  for  the  achievement  of  sophisti¬ 
cated  visual  perception  as  an  end  in  itself,  but  because  visual  perception 
can  serve  motor  control,  and  motor  control  can  serve  vision  to  better  serve 
motor  control,  and  so  on.  What  evolution  "cares  about"  is  who  survives, 
and  that  means,  basically,  who  excels  in  the  four  Fs:  feeding,  fleeing,  fight- 
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ing,  and  reproducing.  How  to  exploit  that  evolutionary  truism  to  develop 
a  theoretical  framework  that  is,  as  it  were,  "motocentric"  rather  than  "vi- 
suocentric"  we  only  dimly  perceive,  (see  also  Powers  1973;  Bullock  et  al. 
1977;  Llinas  1987;  Llinas  1991;  Churchland  1986).  In  any  event,  it  may  be 
worth  trying  to  rethink  and  reinterpret  many  physiological  and  anatomical 
results  under  the  auspices  of  the  idea  that  perception  is  driven  by  the  need 
to  learn  action  sequences  to  be  performed  in  space  and  time. 
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NOTES 

1.  With  apologies  to  Immanuel  Kant. 

2.  For  further  research  along  these  lines  see  for  example,  Ullman  and  Richards  (1984),  Poggio 
et  al.  (1985),  and  Horn  (1986).  For  a  sample  of  current  research  squarely  within  this  tradition, 
see,  for  example,  a  recent  issue  of  Pattern  Analysis  and  Machine  Intelligence. 

3.  Poggio  et  al.  (1985)  say:  "[Early  vision]  processes  represent  conceptually  independent 
modules  that  can  be  studied,  to  a  first  approximation,  in  isolation.  Information  from  the 
different  processes,  however,  has  to  be  combined.  Furthermore,  different  modules  may 
interact  early  on.  Finally,  the  processing  cannot  be  purely  "bottom-up":  specific  knowledge 
may  trickle  down  to  the  point  of  influencing  some  of  the  very  first  steps  in  visual  information 
processing"  (p.  314).  Although  we  agree  that  this  is  a  step  in  the  right  direction,  we  shall 
argue  that  "trickle"  does  not  begin  to  do  justice  to  the  cascades  of  interactivity. 

4.  These  brief  comments  give  no  hint  of  the  complexities  of  optic  flow  cues  and  their  analysis. 
For  discussion,  see  Cutting  (1986). 

5.  In  the  case  of  food-aversion  learning  the  delay  between  ingested  food  and  nausea  may  be 
many  hours. 
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INTRODUCTION 

In  this  chapter  we  outline  some  ideas  aimed  at  understanding  the  neural 
processes  behind  knowledge  retrieval  in  primates.  The  proposal  is  neither 
a  model  nor  a  theory.  It  is  a  framework.  It  is  large-scale,  in  both  cognitive 
and  neural  terms,  by  which  we  mean  that  it  deals  with  psychologically 
meaningful  information,  on  the  one  hand,  and  with  neural  systems  made 
up  of  macroscopic  units  (e.g.,  cortical  regions,  nuclei,  etc.),  and  their  con- 
nectional  patterns  as  studied  by  current  experimental  neuroanatomy. 

We  also  want  to  make  clear  that,  by  knowledge,  we  mean  records  of 
interactions  between  the  brain,  on  the  one  hand,  and  the  entities  and  events 
external  to  it,  on  the  other.  These  entities  and  events  exist  both  outside 
the  organism  and  inside  the  organism,  in  its  body  proper.  Our  focus  in 
this  chapter  is  on  concrete  entities  (and  their  properties)  external  to  the 
organism. 

An  additional  qualification  is  in  order.  Although  the  entities  and  events 
we  speak  about  are  real,  our  framework  does  not  require  assuming  that 
they  are  necessarily  as  we  construct  them  with  our  neural  machineries. 
On  the  contrary,  we  conceptualize  them  as  neurobiological  fabrications, 
shaped  by  the  organism's  dispositions,  especially  those  that  pertain  to 
innate  neural  circuitries  in  charge  of  biological  regulation  for  survival. 

The  framework  concentrates  on  cerebral  cortex  and  has  been  developed 
from  a  background  of  neuropsychologic  studies  in  humans  with  lesions  in 
the  telencephalon.  For  that  reason  we  will  begin  by  reviewing  pertinent 
evidence. 

BACKGROUND 

The  findings  summarized  below  are  based  on  both  recognition  and  recall 
paradigms  and  on  the  use  of  nonverbal  as  well  as  verbal  stimuli.  In  gen¬ 
eral,  they  show  that  individuals  with  lesions  in  association  cortices  within 
the  visual,  auditory,  and  somatosensory  regions,  and  within  "high-order" 
temporal  cortices,  can  no  longer  effectively  and  reliably  conjure  up  knowl¬ 
edge  about  some  conceptual  categories  or  about  unique  entities  within 


certain  categories.  Those  individuals  have  an  impaired  ability  to  gener¬ 
ate  the  internal  representations  on  which  concept  evocation  must  rely.  In 
other  words,  they  can  no  longer  generate  the  ephemeral  displays  of  sen¬ 
sory  information  that,  when  enhanced  by  attention,  would  have  become 
conscious  and  summed  up  knowledge  pertaining  to  a  concept.  At  the  same 
time  those  individuals  do  not  show  defective  attention,  that  is,  their  ability 
to  focus  on  mental  contents  and  bring  them  into  clear  consciousness  is  not 
altered.  This  is  important  to  note  since  attention  is  necessary  for  retrieval  of 
knowledge,  and  since  it  is  known  that  patients  with  lesions  in  other  systems 
(e.g.,  in  parietal  cortices)  can  perform  deficiently  on  knowledge-retrieval 
tasks  simply  because  of  attention  deficits. 

The  key  findings  are  as  follows:  First,  patients  with  bilateral  damage 
to  the  hippocampus  proper  are  unable  to  learn  factual  knowledge.  Ex¬ 
amples  of  factual  knowledge  (also  known  as  declarative  knowledge)  in¬ 
clude  new  entities  and  events,  both  as  members  of  a  category  and  as 
unique  exemplars.  These  patients,  however,  are  able  to  learn  percep- 
tuomotor  skills  whose  retrieval  does  not  require  the  generation  of  a  con¬ 
scious  internal  representation  (so-called  procedural  knowledge).  Exam¬ 
ples  of  these  skills  include  the  learning  of  rotor  pursuit,  mirror  tracing, 
and  mirror  reading,  all  of  which  require  the  gradual  mastering  of  a  task, 
over  several  sessions,  along  a  learning  curve.  They  are  also  able  to  re¬ 
trieve  previously  acquired  factual  knowledge  about  varied  entities  and 
events,  both  as  members  of  a  category  and  as  unique  exemplars.  Their 
perception,  language,  and  motor  control  are  normal  (Milner  et  al.  1968; 
Corkin  1984;  Zola-Morgan  et  al.  1986;  Gabrieli  et  al.  1988).  Patients  with 
Alzheimer's  disease,  in  whom  cell-specific  and  laminar-specific  damage 
compromises  hippocampal  circuitry  (bilaterally)  and  extensive  sectors  of 
high-order  association  cortices,  are  similar  to  patients  with  damage  re¬ 
stricted  to  hippocampus  in  that  they  show  defective  learning  of  factual 
knowledge  and  normal  learning  of  perceptuomotor  skills  (Eslinger  and 
Damasio  1986;  Van  Hoesen  and  Damasio  1987).  They  differ  from  patients 
with  restricted  hippocampal  damage,  however,  in  that  their  retrieval  of  pre¬ 
viously  acquired  factual  knowledge  is  defective,  particularly  as  the  disease 
progresses. 

Second,  the  results  of  bilateral  damage  to  both  the  medial  temporal 
region  that  contains  hippocampus,  entorhinal  cortex,  the  remainder  of 
parahippocampal  gyrus,  amygdala,  and  perirhinal  cortex,  as  well  as  the 
nonmedial  temporal  cortices  are  remarkably  different  from  those  of  hip¬ 
pocampal  damage  alone.  The  nonmedial  temporal  region  includes  area  38 
[temporal  pole],  and  areas  21, 20, 36,  and  part  of  37  [the  human  inferotem- 
poral  region;  see  figure  3.1.] 

The  patient  known  as  Boswell  is  exemplary  (Damasio  etal.  1989).  He  has 
a  severe  impairment  in  the  retrieval  of  previously  acquired  factual  knowl¬ 
edge,  which  affects  all  unique  entities  and  events.  He  cannot  narrate  any 
specific  episode  from  the  several  decades  of  autobiography  that  preceded 
the  onset  of  his  lesion.  He  is  unable  to  recognize  family  or  friends  (from 
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Figure  3.1  Lateral  (top)  and  mesial  ( bottom )  surfaces  of  the  human  brain  with  the  main  gyri 
and  cytoarchitectonic  areas  according  to  Brodmann.  The  images  were  obtained  from  high 
resolution  MR  cuts  reconstructed  in  3-D  by  BRAINVOX  (Damasio  and  Frank  1992). 


face  or  voice),  unique  places  or  objects  (from  sight  or  sound),  although  he 
can  recognize  them  at  categorical  level  (as  faces,  as  houses,  or  cars). 

He  also  shows  a  remarkable  dissociation  for  nonunique  entities.  He  can 
always  assign  any  entity  to  its  "supraordinate"  taxonomic  category.  He 
can  recognize  different  utensils  or  animals,  as  being  a  utensil  or  an  animal ; 
in  other  words,  he  can  indicate  the  most  general  category  to  which  an 
entity  belongs.  On  the  other  hand  he  can  provide  only  the  "middle  level" 
categorization,  or  the  so-called  "basic  object"  level  for  certain  types  of 
entities.  As  an  example,  he  can  provide  "basic  object"  level  recognition  for 
houses  and  tools  ( ranch-type  house,  wrench),  but  most  natural  entities  baffle 
him.  Confronted  with  the  picture  of  a  camel  or  a  zebra  he  will  appropriately 
say  it  is  an  animal  but  not  go  beyond  that  supraordinate  assignment.  In 
brief,  when  shown  the  faces  of  unique  persons  he  was  previously  familiar 
with  (e.g.,  Roosevelt  or  his  wife)  he  was  unable  to  recognize  them,  but 
he  knows  that  their  faces  are  human  faces.  He  also  knows  the  meaning 
of  the  basic  facial  expressions  shown  in  those  faces  (in  stills  or  in  motion) 
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even  if  those  expressions  are  instantiated  in  faces  whose  identity  he  no 
longer  retrieves.  He  has  perfect  knowledge  of  the  components  of  those 
faces  or  other  objects  (shapes,  colors,  movements).  When  asked  to  think 
about  specific  faces  (or  places),  he  cannot  conjure  up  the  face  of  any  one 
particular  person  (e.g.,  he  cannot  conjure  up  his  wife's  face).  Yet  he  can 
generate  an  internal  representation  of  a  "generic"  human  face,  or  of  part  of 
a  face  (e.g.,  a  nose),  or  of  a  geometric  figure  (a  square),  or  of  any  color.  He 
has  no  perceptual  defect  (olfaction  excepted),  no  motor  impairment,  and 
he  can  attend  effectively  to  all  stimuli  he  is  shown,  including  the  stimuli 
that  he  cannot  recognize. 

The  evidence  discussed  so  far  suggests  that  (1)  nonmedial  temporal 
cortices  (polar,  inferotemporal,  posterior  parahippocampal)  are  essential 
for  the  retrieval  of  previously  acquired  factual  knowledge,  especially  for 
unique  exemplars,  none  of  which  can  be  retrieved  after  substantial  bilat¬ 
eral  damage  to  this  sector,  and  (2)  neither  nonmedial  nor  medial  temporal 
cortices  play  a  role  in  the  acquisition  or  retrieval  of  skill  knowledge,  basic 
visual  perception,  or  motor  control. 

Third,  patients  with  bilateral  damage  in  sectors  of  the  inferior  occipital 
and  posterior  temporal  visual  association  cortices  (areas  18,  19  and  pos¬ 
terior  37  in  figure  3.1)  cannot  conjure  up  the  unique  knowledge  pertinent 
to  a  unique  object  when  the  object  is  presented  visually  (i.e.,  they  cannot 
recognize  a  unique  identity)  but  can  see  the  object  and  provide  descrip¬ 
tions  of  its  visual  details,  and  cannot  generate  an  internal  representation  of 
a  unique  visual  entity  when  the  stimulus  is  not  present  (e.g.,  they  are  un¬ 
able  to  conjure  a  specific  face  in  the  absence  of  the  model).  Unlike  patient 
Boswell,  however,  these  patients  can  retrieve  unique  knowledge  relative 
to  a  unique  entity  when  the  stimulus  is  presented  through  a  nonvisual 
channel.  The  unique  voice  of  a  specific  person  allows  the  patient  to  know 
the  unique  identity  behind  it  (Damasio  et  al.  1990b). 

The  evidence  thus  indicates  that  damage  placed  anteriorly  in  the  occipi¬ 
totemporal  system  precludes  retrieval  of  any  unique  knowledge,  regardless 
of  the  sensory  channel  used  to  trigger  the  retrieval,  and  that  posterior  dam¬ 
age  in  specific  sectors  of  the  visual  system  precludes  retrieval  of  unique 
knowledge  only  when  the  stimulus  is  visual.  The  same  kind  of  damage 
impairs  visual  imagery  but  not  visual  perception. 

Fourth,  patients  with  bilateral  damage  in  inferior  visual  association  cor¬ 
tices  within  the  occipital  region  (especially  the  medial  sector  of  areas  18 
and  19  in  figure  3.1)  can  no  longer  perceive  colors  normally,  nor  can  they 
evoke  an  internal  representation  of  those  colors,  such  as  picturing  the  color 
of  blood  (Damasio  et  al.  1980).  It  must  be  noted  that  damage  to  anterior 
visual  and  high-order  association  cortices  does  not  preclude  color  knowl¬ 
edge  or  color  perception.  That  color  imagery  is  lost  with  "early"  (posterior) 
cortical  damage  but  not  with  "late"  (anterior)  integrative  cortical  lesions 
suggests  that  color  is  not  ^represented  in  high-order  cortices.  When  the 
internal  reconstruction  of  a  complex  representation  requires  the  evocation 
of  color,  the  color  "content"  is  generated  from  "early"  visual  cortices. 
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Fifth,  damage  to  human  infero temporal  cortices,  the  areas  21,  20,  36, 
and  37  that  collectively  form  IT,  selectively  impairs  retrieval  of  knowledge 
about  certain  categories  of  entities.  [Before  we  go  any  further  an  anatomi¬ 
cal  clarification  is  in  order.  Please  refer  to  figure  3.2,  comparing  the  human 
and  monkey  temporal  lobes.  The  difference  is  immense.  It  is  important 
to  note  that  the  region  that  corresponds,  in  the  human,  to  the  monkey's 
IT  is  probably  far  more  posterior  and  inferior  than  in  the  monkey.  The 
results  described  here  relative  to  the  human  IT  must  be  considered  in  the 
context  of  this  major  anatomical  difference.  The  region  from  which  Tanaka 
and  his  group  (Fujita  et  al.  1992)  have  recently  recorded,  which  they  call 
"anterior  inferotemporal,"  is  in  the  middle  inferotemporal  region  of  the 
monkey,  and  would  be  in  the  posteroventral  temporooccipital  region  in 
humans.]  Patients  with  such  damage  fail  to  conjure  up  knowledge  per¬ 
taining  to  some  entities  but  easily  conjure  up  knowledge  pertaining  to 
others,  along  consistent  patterns  of  dissociation  (Warrington  and  Shallice 
1984;  McCarthy  and  Warrington  1988;  Damasio  1990).  The  IT  region  seems 
necessary  to  retrieve  knowledge  about  entities  that  are  learned  through 
the  visual  modality  alone  and  that  share  physical  structures  with  several 
other  different  entities;  typical  examples  are  dog-like  animals  such  as  fox, 
raccoon,  coyote,  wolf,  and  German  shepherd  dog,  whose  physical  traits 
strongly  resemble  each  other.  We  have  designated  such  entities  as  visu¬ 
ally  "ambiguous."  Retrieval  of  knowledge  about  visual  entities  with  lesser 
ambiguity  does  not  depend  on  this  region,  and  that  is  why  animals  whose 
shapes  are  "outliers"  (the  elephant  is  the  typical  example)  pose  no  prob¬ 
lem  for  recognition.  Knowledge  of  entities  that  were  learned  through  both 
the  visual  and  somatosensory  modalities  (for  instance,  most  manipulate 
tools  and  utensils)  does  not  depend  on  this  system  either.  The  evidence 
uncovers  a  systematic  correspondence  between  the  presence  of  damage  in 
certain  systems  and  the  impaired  retrieval  of  certain  types  of  knowledge. 
We  believe  the  correspondence  arises  because  of  constraints  dictated  by 
the  physical  characteristics  of  the  entities  and  by  neuroanatomical  design 
(Damasio  1989c;  Damasio  et  al.  1990a). 

In  brief,  (1)  access  to  previously  acquired  nonunique  knowledge  can  be 
disrupted  by  lesions  in  specific  neural  subsystems,  and  (2)  the  defect  is 
not  equal  for  different  categories  of  concrete  knowledge.  Access  to  differ¬ 
ent  types  of  concrete  knowledge  thus  depends  on  different  subsystems. 
As  we  will  note  below,  we  do  not  believe  that  the  damaged  systems  hold 
records  of  entity  representations  per  se.  We  hypothesize  that  they  direct 
the  simultaneous  activation  of  anatomically  separate  regions  whose  con¬ 
junction  defines  an  entity. 

Sixth,  damage  to  left  anterior  temporal  cortices  in  the  temporal  pole  (area 
38)  and  the  anterior  part  of  IT  (areas  20  and  21  in  figure  3.1)  causes  a  severe 
defect  for  naming  of  concrete  entities.  Patients  cannot  access  the  word 
forms  that  belong  to  unique  entities  (proper  noun),  nor  can  they  retrieve  the 
word  forms  that  go  with  varied  nonunique  entities.  Patients  can,  however, 
generate  accurate  descriptions  about  all  the  entities  they  cannot  name. 
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Figure  3.2  A  comparison  of  human  and  monkey  cortices.  Human  cortices  are  on  the  left 
and  monkey  cortices  on  the  right.  Note  that  they  are  not  drawn  to  scale  (the  human  brain 
is  about  10  times  larger  than  the  macaque's).  The  different  position  of  the  infero temporal 
cortex,  and  the  different  relation  that  inferotemporal  cortex  holds  to  the  early  visual  cortices 
are  obvious. 

Only  the  access  to  the  lexical  entries  that  denote  the  overall  concept  is 
defective.  The  nonverbal  conceptual  knowledge  of  the  entities  is  intact. 
The  patients  have  no  grammatical  or  phonemic/phonetic  defects.  We  have 
also  discovered  that  these  patients  are  able  to  generate  the  word  form  that 
accurately  denotes  an  action  or  relationship,  without  the  slightest  difficulty. 
They  can  retrieve  word  forms  (verbs)  describing  the  action  of  entities  whose 
names  they  cannot  retrieve,  or  that  they  may  not  even  know.  For  instance, 
shown  the  picture  of  a  mother  duck  in  a  lake,  being  followed  by  several 
ducklings,  Boswell  said:  "The  little  things  [ducks  or  ducklings,  not  named] 
are  following  her,  the  mother  [duck,  not  named].  They  are  all  swimming." 

Seventh,  damage  to  the  cortices  in  the  left  temporal  pole  (sparing  the 
anterior  part  of  IT)  causes  a  defect  in  the  retrieval  of  word  forms  (e.g.,  the 
names  of  persons  or  places).  Access  to  word  forms  for  common  noun  is 
intact.  Access  to  conceptual  knowledge  of  the  entities  denoted  by  both 
proper  and  common  nouns  is  also  intact  (Damasio  et  al.  1990c;  Graff- 
Radford  et  al.  1990;  Semenza  and  Zettin  1989).  Damage  to  the  same 
regions  in  the  right  hemisphere  seems  not  to  compromise  lexical  access. 

These  findings  suggest  that  the  conjoining  of  nonverbal  and  verbal  acti¬ 
vated  representations  pertaining  to  concrete  entities  depends  on  a  mediator 
mechanism  in  left  anterior  temporal  cortices.  The  mechanism  promotes  the 
reconstruction  of  a  word  form  given  the  concept,  or,  conversely,  the  recon- 


|g  striate  cortex 
extrastriate  cortex 


66 


Damasio  and  Damasio 


struction  of  the  concept  of  an  object  given  the  word  form.  The  systems 
that  support  this  mediational  mechanism  do  not  contain  records  for  either 
words  or  concepts  themselves,  but  rather  records  of  the  probable  combina¬ 
tion  between  them,  that  is,  the  combinations  between  (1)  the  many  records 
that  subsume  the  concept  of  a  concrete  entity,  nonverbally,  and  (2)  the  many 
records  that  subsume  acoustical,  somatosensory,  and  motor  patterns  with 
which  a  given  word  form  can  be  reconstructed.  (It  well  may  be  the  case  that 
other  aspects  of  lexical  characterization,  such  as  the  syntactical  properties 
of  a  given  word,  will  also  be  activated  from  these  systems.)  In  brief,  these 
systems  promote  activity  in  other  cortices  (and  probably  basal  ganglia),  and 
it  is  through  the  resulting  activity  elsewhere  that  lexical  access  or  concept 
access  are  achieved. 

Finally,  damage  in  left  lateral  frontal  cortices  compromises  retrieval  of 
some  word  forms  that  denote  some  classes  of  verbs,  while  leaving  ab¬ 
solutely  intact  the  retrieval  of  word  forms  for  nouns.  This  is  interesting 
evidence  for  the  fact  that  frontal  cortices  (and  the  parietal  and  mesial  cor¬ 
tices  that  feed  into  them)  are  concerned  with  other  aspects  of  conceptual 
representation  (e.g.,  space-time  trajectories  of  entities  rather  than  entity 
structure  itself),  and  neural  processing  (e.g.,  attention,  governance  of  re¬ 
sponse  selection,  and  motor  planning).  It  is  intriguing  to  find  retrieval 
of  some  word  forms  for  verbs  connected  with  this  system  given  the  fact 
that  some  verbs  do  describe  actions  in  space-time  rather  than  structural 
characteristics  of  entities  (Damasio  and  Tranel  1993). 

REGIONALIZATION  OF  KNOWLEDGE  ACCESS  AND 
CORTICOCORTICAL  CONNECTIONS 

The  evidence  summarized  here  suggests  that  the  access  to  different  levels 
and  types  of  knowledge  depends  on  different  neural  systems  and  is  thus 
regionalized.  At  first  pass,  the  correspondence  between  retrieval  impair¬ 
ment  patterns  and  site  of  damage  further  suggests  that 

1.  Damage  to  early  visual  cortices  compromises  the  retrieval  of  features 
(e.g.,  color); 

2.  Damage  to  intermediately  placed  cortices  leaves  the  retrieval  of  fea¬ 
tures  intact,  but  may  compromise  the  retrieval  of  knowledge  pertaining  to 
certain  categories  of  concrete  knowledge,  that  is,  compromise  retrieval  of 
knowledge  for  some  nonunique  entities  while  sparing  others; 

3.  Damage  to  the  anterior-most  cortices  compromises  retrieval  of  knowl¬ 
edge  regarding  virtually  any  unique  entity  or  event  (scene)  but  leaves  intact 
retrieval  of  features,  entity  components,  and  nonunique  entities. 

Features,  nonunique  entities,  and  unique  entities  are  constituted  by 
knowledge  of  remarkably  different  ranks  in  terms  of  amount  of  compo¬ 
nents,  relational  complexity  of  those  components  within  the  entity,  and 
relational  complexity  of  associations  between  the  entity  itself  and  other 
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entities  and  events.  For  instance,  to  classify  an  entity  as  unique,  we  must 
know  about  intrinsic  and  relational  details  that  are  far  more  complex  than 
those  of  a  nonunique  entity.  In  turn,  nonunique  entities  require  more  com¬ 
plexity  than  features. 

These  findings  suggest  a  tentative  principle:  access  to  concrete  knowl¬ 
edge  of  higher  hierarchical  status  requires  structures  in  anteriorly  placed 
temporal  cortices,  whereas  access  to  concrete  knowledge  of  lower  complex¬ 
ity  only  requires  posterior  occipital  cortices.  (It  should  be  noted  that  when 
we  refer  to  anterior,  or  intermediate,  or  posterior,  we  do  not  necessarily 
include  anterior,  intermediate,  or  posterior  structure  of  both  hemispheres. 
On  the  contrary,  because  of  cerebral  hemisphere  dominance  effects,  certain 
types  and  levels  of  knowledge  may  require,  say,  an  anterior,  or  intermedi¬ 
ate  cortex  of  one  hemisphere  only.) 

The  cortical  regions  whose  damage  we  discussed  above  are  anatomically 
distinguishable  in  a  variety  of  ways  (e.g.,  cytoarchitecture,  subcortical  con¬ 
nections,  possibly  intrinsic  circuitry),  but  we  have  chosen  to  focus  on  their 
distinction  in  terms  of  long-range  corticocortical  projections.  The  impaired 
retrieval  of  more  complex  knowledge  correlates  with  damage  to  the  cortices 
located  closest  to  the  apices  of  feedforward  chains  culminating  in  entorhi- 
nal  cortex,  and  farthest  from  the  beginning  of  the  feedforward  chains  in 
primary  sensory  cortices.  Reciprocating  feedback  projections  from  those 
cortices  recapitulate  the  feedforward  projections  in  reverse  direction  (Van 
Hoesen  1982;  Felleman  and  Van  Essen  1991).  The  impaired  retrieval  of  less 
complex  knowledge  correlated  with  damage  in  cortices  located  more  pos¬ 
teriorly  suggests  that  there  is  a  principled  relationship  between  " rankings" 
of  knowledge  access  and  "rankings"  of  corticocortical  connections. 

As  we  will  discuss  later,  access  to,  leading  to  retrieval  of,  must  be  dis¬ 
tinguished  from  represented  at.  The  knowledge  that  can  be  accessed  from 
anterior  temporal  cortex  is  not  fully  represented  in  anterior  temporal  cortex 
in  the  sense  that  no  "image"  is  likely  to  be  there.  Incidentally,  this  is  the 
position  we  take  in  interpreting  the  meaning  of  "face"  cells  or  "hand"  cells, 
or,  for  that  matter,  the  cells  recently  described  by  Tanaka's  group  (Fujita  et 
al.  1992).  Rather,  we  see  these  cells  as  part  of  the  network  whose  activity 
may  reenact  an  explicit  representation.  When  they  are  activated,  these  cells 
in  high-order  cortices  contribute  to  the  reenactment  of  the  explicit  repre¬ 
sentation,  i.e.,  they  are  critical  to  the  neural  process  on  the  basis  of  which 
we  experience  the  representation,  but  they  are  neither  the  "sole  basis  for" 
nor  the  "site  of"  that  neural  process.  There  is  no  single  basis  or  site  for 
such  a  process.  Finally,  we  should  note  that  what  we  call  knowledge  cor¬ 
responds  to  records  of  interactions  between  (1)  entities  and  events  external 
to  the  individual  (or  internal  to  the  individual  but  external  to  the  brain), 
and  (2)  the  brain  (such  as  it  is  anatomically  at  the  time  of  the  interactions). 
Although  the  external  entities  and  events  are  real,  they  are  not  necessarily 
as  we  construct  them  with  our  neural  apparatus. 

The  relationship  suggested  by  the  evidence,  then,  is  between  ranking 
of  knowledge  (as  qualified  above)  and  ranking  of  corticocortical  connections. 
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The  emphasis  on  this  principle  does  not  mean  that  we  are  ignoring  the 
existence  and  contributory  role  of  many  other  connections  also  available 
at  all  stations  of  these  hierarchies,  that  include  (1)  heterarchical  connec¬ 
tions  between  and  among  parallel  cortico-cortical  projection  streams;  (b) 
local  (intrinsic)  cortical  connections;  (c)  subcortical  connections,  direct  or 
re-entrant;  and  (d)  commissural  connections.  It  simply  means  that  we 
suspect  that  ranking  of  cortico-cortical  connections  is  the  most  distinctive 
aspect  of  these  regions  inasmuch  as  the  substrates  of  knowledge  are  con¬ 
cerned.  Those  other  distinctive  anatomical  aspects  of  different  cortical 
stations  are  critical  for  the  continued  development  and  adjustment  of  the 
system.  Elsewhere  (Damasio  1989a,b)  and  below,  we  have  argued  that  as 
the  organism  interacts  with  the  environment,  the  selection  of  circuitries 
that  corresponds  to  varied  interactions  with  the  environment  is  carried 
out  with  the  help  of  those  other  systems.  For  instance,  subcortical  connec¬ 
tions  hailing  from  nuclei  concerned  with  biological  drives  are  critical  to 
this  process. 

KNOWLEDGE  REPRESENTATION 

Dispositional  (Nonmapped)  Representations  versus  Explicit  (Mapped) 
Representations 

What  did  the  areas  damaged  in  the  patients  described  above  contain  or 
contribute  while  they  were  healthy,  to  the  mental  representations  on  which 
knowledge  is  based?  What  is  the  relation  between  the  knowledge  that  fails 
to  be  retrieved  and  the  station  of  corticocortical  connections  destroyed  by  a 
lesion?  We  will  entertain  two  alternatives.  In  the  first  alternative,  the  con¬ 
ceptual  knowledge  relative  to  a  given  entity  is  contained  in  the  circuitry 
of  the  damaged  region.  (This  is  not  only  the  classic  alternative  but  the  one 
that  remains  implicit  behind  most  current  work  in  neuropsychology  and 
neurology.)  Both  the  reactivation  of  sensory  and  motor  properties  defin¬ 
ing  a  concept,  as  well  as  the  "know-how"  needed  to  reconstitute  those 
properties,  would  depend  on  that  region  alone.  Their  reinstantiation  in 
consciousness  would  result  from  attended  neural  activity  within  that  one 
region.  In  that  traditional  alternative,  the  failure  in  knowledge  retrieval 
comes  from  the  absence  of  the  region  and  of  the  high-level  representations 
previously  contained  in  it.  We  must  reject  this  alternative.  We  believe 
that  the  results  discussed  above  are  incompatible  with  this  view,  because 
the  nature  of  the  losses  following  lesions  at  different  levels  would  be  dif¬ 
ferent.  Rather  than  respecting  the  levels  of  complexity  and  the  relative 
functional  kinship  of  entities,  the  lesions  would  lead  to  "unprincipled" 
losses  in  which  all  knowledge  about  entire  categories  of  entities  would 
simply  vanish. 

In  the  alternative  we  favor  the  sensorimotor  properties  on  which  a  con¬ 
cept  is  based  would  be  retrieved  from  early  sensory  and  motor  cortices,  and 
the  effects  of  focal  attention  would  be  placed  in  those  parts  of  the  dis- 
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tributed  network.  The  retrieval  would  have  been  directed  by  the  area  now 
damaged,  had  it  been  intact.  The  circuitry  of  the  damaged  region  previ¬ 
ously  contained  knowledge  about  which  separate  brain  regions  should  be 
reactivated  to  conjure  up  varied  properties,  and  about  hozv  those  regions 
should  be  related  temporally  and  spatially. 

The  failure  in  knowledge  retrieval  comes  from  the  damaged  region  being 
unable  to  conduct  the  reconstruction,  but  the  reconstruction  itself  would 
have  depended  on  circuitry  in  many  other  regions.  In  short,  a  lesion  in 
a  given  place  modifies  the  "working  habits"  of  nonlesioned  regions  that 
are  connectionally  related  to  the  lesioned  region.  Following  a  lesion,  the 
remainder  of  the  brain  works  differently  rather  than  as  before.  This  may 
sound  trivial  but  it  is  not.  The  results  of  lesion  experiments  are  often  inter¬ 
preted  as  if  the  healthy  systems  connected  to  the  damaged  area  continue 
doing  precisely  what  they  did  before  the  lesion. 

The  alternative  we  favor  implies  a  relative  functional  compartmental- 
ization  for  the  normal  brain.  One  large  set  of  systems  in  early  sensory 
cortices  and  motor  cortices  would  be  the  base  for  "sense"  and  "action" 
knowledge,  i.e.,  the  highly  multiregional  substrate  for  "explicit"  represen¬ 
tations  which  are  the  key  to  our  experience  of  knowledge.  Another  set  of 
systems  in  higher-order  cortices  would  orchestrate  time-locked  activities 
in  the  former,  that  is,  would  promote  and  establish  temporal  correspon¬ 
dences  among  separate  areas.  Yet  another  set  of  systems  would  ensure  the 
attentional  enhancement  required  for  the  concerted  operation  of  the  oth¬ 
ers.  These  sets  of  systems  would  operate  under  two  major  influences:  (1) 
internal  biases,  expressed  in  brain  core  networks  concerned  with  enacting 
biological  drives  and  instincts  required  for  survival,  and  (2)  the  structures 
and  actions  in  the  external  environment. 

In  the  overall  system  we  envision,  there  is  neither  cartesian  dualism  be¬ 
tween  matter  and  mind,  nor  homuncular  dualism  between  "images"  and 
a  "perceiver"  of  those  images.  There  is  also  no  infinite  regress.  On  the 
contrary,  neurally  speaking,  there  is  a  finite  number  of  convergence  steps 
downstream  from  the  system  in  which  the  neural  activity  corresponding 
to  explicit  images  takes  place.  But  the  top  steps  in  the  hierarchy  are  neither 
a  homunculus  nor  a  perceiver.  They  are  merely  the  most  distant  conver¬ 
gence  points  from  which  divergent  retroactivation  can  be  triggered.  What 
is  infinite,  instead,  is  the  ceaseless  production  of  new  activity  states,  in 
early  sensory  cortices  and  in  motor  cortices,  across  time.  It  is  those  neu¬ 
ral  states,  one  after  the  other,  that  can  be  said  to  constitute  "regresses" 
for  the  previous  state.  But  it  should  be  noted  that  they  occur  in  the  same 
set  of  systems,  at  different  times,  unlike  the  classic  and  much  maligned 
homuncular  regress  that  occurs  in  different  neural  sites,  ever  more  re¬ 
moved  from  the  "perceptual"  site,  in  space  and  in  time.  It  is  the  perpetu¬ 
ally  recursive  property  of  corticocortical  systems  that  permits  this  special 
form  of  regress.  We  have  previously  discussed  evidence  in  support  of  this 
view  (Damasio  1989a, b);  additional  arguments  are  discussed  in  Damasio 
(1994). 
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Reconstructing  Mapped  Representations 

The  reconstruction  of  pertinent  property  "representations"  is  thus  accom¬ 
plished  in  many  separate  cortical  regions,  by  means  of  long-range  cortic- 
ocortical  feedback  projections  that  mediate  relatively  synchronous  excita¬ 
tory  activation.  In  the  large  scale  reconstruction  that  we  envision  from 
higher-order  cortices  (such  as  those  in  anterior  temporal  lobe),  the  time 
scale  of  the  synchronization  would  be  in  the  order  of  several  hundred 
msec,  and  even  beyond  1000  msec,  the  scale  required  for  meaningful,  con¬ 
scious  cognition.  But  at  more  local  levels,  for  instance,  in  posterior  tem¬ 
poral  cortices,  the  scale  would  be  smaller,  in  the  order  of  tens  of  millisec¬ 
onds.  The  return  projections  necessary  for  the  reconstruction  are  aimed 
toward  layers  I  and  V  of  the  cortex,  mainly  the  former,  in  which  feedfor¬ 
ward  projections  originated  (see  Rockland  and  Virga  1989).  We  have  called 
the  neural  device  from  which  reconstructions  are  conducted  a  convergence 
zone. 

Convergence  Zones 

In  essence,  a  convergence  zone  is  an  ensemble  of  neurons  within  which 
many  feedforward /feedback  loops  make  contact.  Its  connectional  struc¬ 
ture  is  as  follows:  a  convergence  zone  receives  feedforward  projections 
from  cortical  regions  located  in  the  connectional  level  immediately  be¬ 
low,  sends  reciprocal  feedback  projections  to  the  originating  cortices;  sends 
feedforward  projections  to  cortical  regions  in  the  next  connectional  level 
and  receives  return  projections  from  it;  is  influenced  by  a  broad  class  of 
cortices  concerned  with  attentional  control  and  response  selection,  such 
as  prefrontal  and  cingulate  directly  or  indirectly,  which  are  in  turn  reen- 
trantly  connected  to  basal  ganglia;  receives  projections  from  heterarchi- 
cally  placed  cortices;  and  receives  projections  from  subcortical  nuclei  in 
thalamus,  basal  forebrain,  brain  stem,  etc.  This  rich  network  of  extrinsic 
connections  is  complemented  by  a  complex  network  of  intrinsic  intralam¬ 
inar  and  interlaminar  connections. 

A  convergence  zone  is  located  within  a  convergence  region.  We  envision 
that  there  are  in  the  order  of  thousands  of  convergence  zones,  which  are 
all  microscopic  neuron  ensembles,  located  within  the  macroscopic  conver¬ 
gence  regions  that  have  been  cytoarchitectonically  defined  and  that  num¬ 
ber  about  one  hundred.  Both  convergence  regions  and  convergence  zones 
come  into  existence  under  genetic  control.  But  epigenetic  control,  as  the 
organism  interacts  with  the  environment,  may  alter  convergence  regions, 
and  massively  alter  convergence  zones  through  synaptic  strengthening.  As 
noted  above,  synaptic  strengthening  occurs  under  particular  conditions, 
in  which  circumstances  external  to  the  brain  match  the  survival  needs  of 
the  organism,  its  intentions  so  to  speak,  as  expressed  in  biological  drive 
networks.  It  is  reasonable,  then,  to  talk  about  synaptic  strengthening  as 
a  selective  process,  in  the  sense  in  which  Changeux  (1976)  and  Edelman 
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(1987)  have  used  the  concept,  although  this  does  not  commit  us  to  a  par¬ 
ticular  unit  of  selection  such  as  a  neuron  or  a  neuron  group. 

A  convergence  zone  is  thus  a  means  of  establishing,  through  synaptic 
strengthening,  preferred  feed  forward/ feedback  loops  that  use  subsets  of 
neurons  within  the  ensemble.  A  subset  of  the  neurons  in  the  convergence 
zone  would  "learn"  to  activate  a  large  number  of  spatially  distributed  neu¬ 
ral  ensembles,  in  temporal  proximity,  by  means  of  feedback  projections. 
The  convergence  zone  could  be  excited  by  any  (or  a  subset)  of  the  feedfor¬ 
ward  projections  that  were  originally  paired  with  the  feedbacks  coming 
out  of  the  convergence  zone,  or  by  feedback  projections  of  convergence 
zones  from  a  higher  station,  or  from  heterarchical  connections. 

A  convergence  zone  develops  under  the  influence  of  (1)  temporally  close 
activity  in  multiple  feedforward  and  feedback  lines  that  are  simultane¬ 
ously  active  when  a  number  of  anatomically  separate  regions  are  active 
and  are  providing  the  normal  substrate  for  a  given  perceptual /thought 
process,  and  (2)  modulatory  action  from  feedback  and  feedforward  pro¬ 
jections  from  ipsilateral  and  contralateral  cortices,  and  subcortical  nuclei, 
during  (1).  The  development  of  a  convergence  zone  also  depends  on  local 
interactions  among  neurons  (e.g.,  from  their  intrinsic  collateral  arboriza¬ 
tions).  A  convergence  zone  would  be  the  result  of  convergence  of  feed¬ 
forward  inputs,  but  its  feedback  projections  operate  by  diverging  toward 
the  origin  of  feedforward  projections.  Naturally,  when  we  refer  to  neurons 
in  a  convergence  zone  we  refer  to  the  synaptic  pools  made  up  of  contacts 
among  those  neurons. 

We  have  hypothesized  that  there  would  be  two  main  types  of  activity  in 
a  convergence  zone.  In  the  stable  type,  the  excitation  of  one  or  a  few  neu¬ 
rons  feeding  into  it  generates  maximal  temporally  close  activity  in  many 
feedbacks  that  participate  in  the  convergence  zone.  This  then  generates 
temporally  close  activations  in  several  regions  that  originally  projected 
forward  to  it.  What  we  envision  convergence  zones  achieving  is  the  recre¬ 
ation  of  separate  sets  of  neural  activity  that  were  grossly  simultaneous, 
that  is,  that  coincided  during  the  time  window  necessary  for  us  to  attend 
to  it  and  be  conscious  of  it,  which  means  hundreds  of  milliseconds.  How¬ 
ever,  this  may  not  necessarily  translate  into  simultaneous  activity  within 
the  convergence  zone.  In  fact,  it  is  much  more  likely  that  there  would  be 
an  extremely  fast  sequence  of  activations  that  would  make  separate  neural 
regions  come  on-line  in  some  order  imperceptible  to  consciousness. 

Convergence  zones  also  fire  forward,  through  their  feedforward  pro¬ 
jections,  into  other  convergence  zones  located  at  a  higher  level.  In  turn, 
feedback  firing  from  that  higher  level  convergence  zones  would  broaden 
the  scope  of  regions  activated  in  response  to  the  initial  stimulus.  This 
first  type  of  activity  depends  on  strong  local  synaptic  linkages  among 
subsets  of  neurons  in  a  convergence  zone.  In  a  second  type  of  activ¬ 
ity,  less  stable  modes  of  firing  would  activate  subsets  of  feedbacks,  lead¬ 
ing  to  novel  combinations  of  activity.  This  would  depend  on  transient 
combinations. 
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In  short,  knowledge  retrieval  would  be  based  on  relatively  simultaneous, 
attended  activity  in  many  early  cortical  regions,  engendered  over  several 
recursions  in  such  a  system.  Separate  activities  in  early  cortices  would  be 
the  basis  for  reconstructed  representations.  The  level  at  which  knowledge 
is  retrieved  (e.g.,  supraordinate,  basic  object,  subordinate)  would  depend 
on  the  scope  of  multiregional  activation.  In  turn,  this  would  depend  on  the 
level  of  convergence  zone  that  is  activated.  Low  level  convergence  zones 
bind  signals  relative  to  entity  categories  (e.g.,  the  color  and  shape  of  a  tool), 
and  are  placed  in  association  cortices  located  immediately  beyond  (down¬ 
stream  from)  the  cortices  whose  activity  defines  featural  representations. 
In  humans,  in  the  case  of  a  visual  entity,  this  would  include  cortices  in  ar¬ 
eas  37  and  39,  downstream  from  the  maps  in  V3,  V4,  and  V5.  Higher-level 
convergence  zones  bind  signals  relative  to  more  complex  combinations, 
for  instance,  the  definition  of  object  classes  by  binding  signals  relative  to 
its  shape,  color,  sound,  temperature,  and  smell.  These  convergence  zones 
are  placed  at  a  higher  level  in  the  corticocortical  hierarchy  (e.g.,  within 
more  anterior  sectors  of  37  and  39,  22,  and  20).  The  convergence  zones 
capable  of  binding  entities  into  events  and  describing  their  categorization 
are  located  at  the  top  of  the  hierarchical  streams,  in  anterior  most  temporal 
and  frontal  regions. 

The  "firing"  knowledge  embodied  in  the  convergence  zone  is  the  result 
of  previous  learning,  during  which  feedforward  projections  and  reciprocat¬ 
ing  feedback  projections  were  simultaneously  active.  Both  during  learning 
and  retrieval,  the  neurons  in  a  convergence  zones  are  under  the  control  of 
a  variety  of  cortical  and  noncortical  projections.  This  includes:  projections 
from  thalamus,  the  nonspecific  neurotransmitter  nuclei,  and  other  cortical 
projections  from  convergence  zones  in  prefrontal  cortices,  cortices  located 
higher  up  in  the  feedforward  hierarchy,  homologous  cortices  of  the  oppo¬ 
site  hemisphere,  and  heterarchical  cortices  of  parallel  hierarchical  streams. 
The  essence  of  this  framework,  then,  comprises  reconstruction  of  entities 
and  scenes  from  component  parts  and  integration  of  component  parts  by 
time  correlations.  The  requisite  reactivation  is  mediated  by  excitatory  pro¬ 
jections. 


CONCLUDING  REMARKS 


In  closing,  we  would  like  to  add  a  few  words  concerning  the  evidence  now 
available  for  this  framework,  as  well  as  the  relation  it  may  have  to  other  the¬ 
oretical  approaches.  In  addition  to  the  evidence  adduced  above,  from  hu¬ 
man  neuropsychology  and  from  experimental  neuroanatomy,  there  is  also 
supporting  evidence  in  the  recent  neurophysiological  findings  of  Singer 
and  colleagues  (1993),  Eckhom  and  colleagues  (1988),  and  Fetz  et  al.  (1991), 
all  of  which  indicate  the  presence  of  temporal  correlations  among  anatom¬ 
ically  separate  cortical  regions  relative  to  a  single  stimulus.  These  results 
have  been  obtained  in  perceptual  or  motor  experiments,  but  there  is  no  rea¬ 
son  why  they  cannot  be  extrapolated  for  the  knowledge  retrieval  processes 
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we  are  discussing  here.  The  essence  of  these  results  is  that  geographically 
separate  regions  of  brain  can  be  active  at  the  same  time  when  their  activity 
is  related  to  the  same  thing.  There  is  also  evidence  for  the  type  of  function¬ 
ally  segregated  neuron  ensembles  we  call  convergence  zones  in  the  recent 
work  of  Fujita  et  al.  (1992),  and  we  interpret  the  classical  "face"  and  "hand" 
cells  as  being  part  of  convergence  zones.  Concerning  other  approaches, 
the  massive  recurrence  of  neuroanatomical  pathways  that  we  propose  for 
knowledge  retrieval  is  a  component  of  Edelman's  model  of  visual  per¬ 
ception  (Edelman  1987),  in  which  feedback  activity  is  subsumed  by  the 
concept  of  "reentry."  There  are,  however,  several  distinctions.  Edelman's 
model  does  not  use  a  convergence-divergence  architecture.  The  maps  are 
fully  and  reciprocally  interconnected,  in  both  a  hierarchical  and  heterarchi¬ 
cal  manner.  This  characteristic  seems  well  suited  to  the  constructive  roles 
that  the  very  early  visual  cortical  regions  are  likely  to  play,  and  for  which 
a  convergence  zone  architecture  might  not  be  sufficient.  It  is  supported 
by  neuroanatomical  findings  (Zeki  and  Shipp  1988;  Rockland  and  Pandya 
1979).  Another  distinction  concerns  the  fact  that  reentry's  principal  means 
of  operation  is  the  "synthesis  of  signals"  within  the  neuronal  populations 
that  get  reentered.  The  framework  we  have  proposed  does  not  include 
such  a  means  of  operation  and  relies  instead  on  a  correlative  operation 
which  appears  similar  to  the  one  Singer  and  von  der  Malsburg  envision 
(although  in  recent  work  from  the  same  group,  reentry  also  accomplishes 
a  correlative  function  (see  Tononi  et  al.  1992;  Edelman  1992). 
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The  Interaction  of  Neural  Systems  for 
Attention  and  Memory 


Robert  Desimone,  Earl  K.  Miller,  and  Leonardo 
Chelazzi 


INTRODUCTION 


The  visual  recognition  of  objects  depends  on  a  cortical  processing  pathway 
that  begins  in  area  VI  and  continues  through  areas  V2,  V3,  V4,  and  the 
inferior  temporal  (IT)  cortex  (Ungerleider  and  Mishkin  1982;  Desimone  et 
al.  1985;  Maunsell  and  Newsome  1987;  Desimone  and  Ungerleider  1989; 
Felleman  and  Van  Essen  1991).  As  one  proceeds  along  this  pathway,  the 
receptive  fields  of  neurons  increase  steadily  in  size  and  the  analysis  of 
object  features  becomes  increasingly  complex.  Although  these  large  fields 
will  typically  contain  many  different  stimuli,  we  have  previously  shown 
that  spatial  attentional  mechanisms  limit  the  amount  of  information  that 
is  processed  within  them.  When  one  attends  to  a  stimulus  at  one  location 
within  the  receptive  field  of  a  neuron  in  V4  or  IT  cortex,  the  responses  to 
stimuli  at  other,  ignored,  locations  are  suppressed  (Moran  and  Desimone 
1985).  Thus,  spatially  directed  attention  controls  the  information  processed 
in  extrastriate  cortex  and  regulates  access  to  memory.  Generally  speaking, 
we  remember  what  we  attend  to.  The  spatial  attention  system  that  controls 
extrastriate  processing  appears  to  involve  a  number  of  different  cortical 
and  subcortical  structures,  several  of  which  are  closely  associated  with  the 
oculomotor  system  (Desimone  et  al.  1990;  Posner  and  Driver  1992;  Posner 
and  Petersen  1990;  Colby  1992;  see  chapter  9  by  Posner  and  Rothbart). 

Although  attention  may  regulate  access  to  memory,  the  reverse  also  oc¬ 
curs.  Consider  the  following  scenarios.  While  driving  to  work,  you  pay 
little  attention  to  all  of  the  surrounding  cars  traveling  in  their  lanes  but 
react  immediately  to  a  car  that  makes  an  unexpected  change  in  direction 
in  front  of  you.  At  work,  you  walk  into  your  office  and  are  startled  that 


your  familiar  desk  chair  has  been  replaced  by  a  new  one.  After  studying 
the  new  chair,  you  search  for  your  coffee  cup  buried  among  the  objects 
cluttering  your  desk.  When  it  is  found,  you  switch  your  attention  to  the 
wall  of  the  room,  where  you  know  you  will  find  your  clock. 


Each  of  the  everyday  behaviors  described  above  illustrates  how  memory 


guides  attention.  In  the  case  of  the  car  changing  direction,  the  car  violated 


the  expectation  of  its  behavior  built  up  over  the  course  of  the  previous  few 


seconds  or  minutes,  thereby  eliciting  attentional  and  orienting  responses. 


In  the  case  of  the  new  chair,  attention  was  attracted  by  the  mismatch  be¬ 
tween  the  new  chair's  image  and  the  representation  of  the  familiar  chair 
in  long-term  memory.  In  both  cases,  the  representation  of  stimuli  in  short- 
and  long-term  memory  contributed  as  much  to  their  salience  as  did  purely 
visual  properties  such  as  color  or  brightness.  In  the  case  of  the  coffee  cup, 
it  was  the  representation  of  the  cup  in  long-term  memory  that  guided  the 
attentional  search  of  the  cluttered  desk.  Even  in  the  case  of  the  clock  on 
the  wall,  which  might  be  considered  a  simple  case  of  spatially  directed 
attention,  it  was  the  memory  of  the  location  of  the  clock  in  the  room  that 
guided  the  locus  of  attention.  As  we  will  see  below,  memory  not  only 
influences  attention,  but  is  intertwined  with  the  on-going  processing  of 
visual  information  in  the  cortex. 

In  this  chapter  we  take  a  bottom-up  approach  to  developing  models  of 
higher  brain  function.  We  first  describe  some  new  results  on  the  proper¬ 
ties  of  cortical  neurons  in  monkeys  performing  mnemonic  and  attentional 
tasks.  With  these  physiological  findings  serving  as  both  constraints  and 
inspiration,  we  begin  to  sketch  out  how  the  neural  systems  underlying 
memory  and  attention  interact,  resulting  in  self-directed  behavior. 

SHORT-TERM  MEMORY 

Most  cognitive  scientists  and  neuroscientists  would  probably  accept  the 
notion  that  the  neural  mechanisms  of  memory  include  a  facility  for  the 
temporary  storage  of  information.  The  neuropsychological  evidence  for 
separate  mechanisms  of  long-  and  short-term  memory  is  dramatic,  as  am¬ 
nesic  patients  with  damage  to  the  medial  temporal  lobe  may  show  normal 
retention  of  information  for  a  few  seconds  or  minutes  but  may  be  com¬ 
pletely  unable  to  hold  memories  for  longer  periods  of  time  (see  Baddeley 
1986, 1990,  for  reviews  of  both  normal  human  memory  and  amnesia).  Al¬ 
though  one  can  make  numerous  additional  distinctions  among  different 
types  or  components  of  both  long-  and  short-term  memory,  we  will  use 
the  terms  in  a  generic  sense  and  describe  some  of  the  physiological  results 
first  before  speculating  on  what  aspects  of  memory  they  may  explain. 

Recent  physiological  results  suggest  that  at  least  one  type  of  short-term 
memory  may  be  an  intrinsic  property  of  visual  cortex.  In  our  work,  we 
record  from  neurons  in  anterior  IT  cortex  of  rhesus  monkeys  performing 
delayed  matching  to  sample  tasks  (Miller  et  al.  1991b,  1993).  In  the  stan¬ 
dard  form  of  this  task,  a  monkey  is  shown  a  "sample"  stimulus  followed, 
after  a  short  delay,  by  a  "test"  stimulus,  and  it  must  indicate  whether  or 
not  the  test  stimulus  matches  the  sample.  Thus,  the  task  requires  a  type  of 
stimulus  memory,  or  recognition  memory,  lasting  for  the  length  of  a  behav¬ 
ioral  trial.  Several  studies  have  shown  that  IT  neurons  respond  differently 
to  the  test  stimulus  depending  on  whether  it  matches  the  sample  (Gross 
et  al.  1979;  Mikami  and  Kubota  1980;  Baylis  and  Rolls  1987;  Vogels  and 
Orban  1990;  Riches  et  al.  1991;  Eskandar  et  al.  1992),  and  other  studies 
have  shown  that  some  IT  neurons  have  elevated  activity  during  the  de- 
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lay  period,  as  if  they  are  actively  maintaining  the  memory  of  the  sample 
(Fuster  and  Jervey  1981;  Mikami  and  Kubota  1980;  Miyashita  and  Chang 
1988;  Miyashita  1988;  Sakai  and  Miyashita  1991;  Fuster  1990).  However, 
for  a  neural  memory  mechanism  to  be  useful,  it  must  have  the  capacity 
to  retain  information  over  long  intervals  that  are  not  blank  but  rather  are 
filled  with  new  stimuli  entering  the  visual  system,  competing  for  process¬ 
ing,  and  presumably  activating  the  same  cells  involved  in  the  storage  of 
memory  traces.  To  study  this  interplay  between  perceptual  and  mnemonic 
processing,  we  record  from  neurons  in  anterior  ventral  IT  cortex  in  a  task 
that  requires  the  monkeys  to  retain  items  in  memory  while  concurrently 
processing  new  stimuli  (Miller  et  al.  1991b,  1993). 

On  each  trial  of  the  task,  from  0  to  5  stimuli  intervene  between  the  sample 
and  the  final  matching  test  stimulus.  The  stimuli  are  complex  patterns, 
such  as  faces,  fruit,  or  textures,  which  the  monkey  has  seen  before.  These 
stimuli  elicit  stimulus-selective  responses  from  the  large  majority  of  cells, 
but  we  do  not  attempt  to  understand  which  features  of  the  stimuli  activate 
a  given  IT  neuron.  The  basis  of  object  coding  in  IT  cortex  is  currently  not 
understood — we  simply  assume  that  it  occurs  (see  chapter  8  by  Poggio  and 
Hurlbert). 

How  is  the  memory  of  the  sample  maintained?  The  surprising  result  is 
that,  for  nearly  half  the  cells  in  the  cortex,  the  memory  of  the  sample  is  re¬ 
flected  in  the  responses  to  the  test  stimuli.  That  is,  of  the  cells  that  respond 
to  a  given  sample  stimulus,  the  responses  of  about  half  of  them  are  a  joint 
function  of  the  current  stimulus  (i.e.,  the  magnitude  of  response  depends  in 
part  on  how  well  the  stimulus  fits  within  the  cell's  "feature  domain")  and 
the  stimulus  in  memory.  For  the  large  majority  of  these  cells,  responses  to 
the  test  items  are  suppressed  if  they  match  the  sample  in  memory  (see  also 
Baylis  and  Rolls  1987;  Riches  et  al.  1991;  Eskandar  et  al.  1992).  The  reduc¬ 
tion  in  response  is  typically  proportional  to  the  magnitude  of  the  response 
to  a  given  stimulus  (measured  when  it  is  presented  as  a  sample  or  non¬ 
matching  item).  For  example,  if  a  cell  prefers  red  stimuli,  these  stimuli  will 
show  the  greatest  reduction  in  response  when  they  match  the  sample  item. 
Furthermore,  the  suppression  is  maintained  even  if  up  to  five  nonmatching 
stimuli  intervene,  which  is  the  maximum  we  have  tested.  The  "memory- 
span"  of  this  suppressive  effect  seems  to  be  as  long  as  the  monkey  can  per¬ 
form  the  task.  A  few  cells  show  opposite  behavior  (enhanced  responses 
with  matching),  and  the  remaining  cells  (about  half  the  population)  seem 
to  convey  only  sensory  information  and  are  not  affected  by  memory. 

Not  only  are  responses  to  matching  items  affected  by  the  specific  item 
in  memory,  but  so  are  responses  to  nonmatching  items,  an  effect  that  has 
been  observed  by  Eskandar  et  al.  (1992).  That  is,  the  response  to  a  given 
nonmatching  item  depends  on  which  stimulus  had  been  seen  as  the  sample. 
There  is  suggestive  evidence  that  these  effects  are  related  to  the  similarity 
of  a  given  nonmatching  item  to  the  stimulus  in  memory  (Miller  et  al.  1993). 
The  more  similar  is  the  current  stimulus  to  the  one  in  memory,  the  more  the 
response  to  the  current  stimulus  appears  to  be  suppressed.  Thus,  IT  cells 
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Figure  4.1  Responses  of  a  population  of  IT  neurons  to  test  stimuli  that  matched  the  sample 
stimulus  in  memory  and,  for  comparison,  the  response  to  the  same  stimuli  when  they  did  not 
match.  The  difference  line  plots  the  difference  between  the  two  histograms.  The  bin  width  is 
10  msec.  (Adapted  from  Miller  et  al.  1993) 


seem  to  be  computing  similarity  to  memory  traces,  rather  than  matching 
per  se. 

On  the  basis  of  these  data,  we  have  proposed  that  a  population  of  IT 
neurons  functions  as  "adaptive  mnemonic  filters"  whose  responses  are  a 
joint  function  of  the  current  stimulus  and  stored  memory  traces.  A  given 
IT  neuron  gives  its  best  response  to  stimuli  that  contain  features  within  the 
cell's  feature  domain  (i.e.,  has  the  appropriate  color,  shape,  texture,  etc.) 
but  that  have  not  been  recently  seen.  Repetition  is,  in  a  sense,  a  type  of 
stimulus  feature. 

A  critical  question  for  constructing  a  model  of  short-term  memory  is 
whether  the  mnemonic  influences  on  IT  responses  are  generated  within 
(or  before)  IT  cortex  or  result  from  feedback  from  other  structures.  An 
analysis  of  the  time  course  of  the  effects  in  IT  argues  against  certain  types 
of  feedback.  Figure  4.1  shows  the  time  course  of  the  population  response 
to  matching  and  nonmatching  items  in  IT  cortex.  The  suppression  of  re¬ 
sponse  to  matching  items  (compared  to  nonmatching)  begins  at  the  onset 
of  the  visual  response  (i.e.,  within  80  msec  of  stimulus  onset).  Thus,  it 
seems  highly  unlikely  that  the  suppression  is  due  to  feedback  from  mem¬ 
ory  structures  beyond  IT  cortex  that  do  the  actual  detection  of  matching 
items.  It  is  still  possible,  however,  that  critical  feedback  occurs  at  the  time 
of  storage  of  the  memory  trace  and  persists  through  the  time  of  retrieval. 

The  fact  that  the  suppression  begins  by  nearly  the  first  stimulus-evoked 
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action  potential  in  IT  cortex  also  argues  against  the  idea  that  the  effect 
depends  on  lengthy  temporal  processing  in  this  region.  We  have  tested 
whether  the  match-nonmatch  status  of  a  stimulus  is  coded  by  specific  tem¬ 
poral  variations  in  response  (i.e.,  responses  that  differ  in  their  time-course 
but  have  the  same  average  rate)  by  using  the  principal  components  of  the 
spike  trains  of  IT  cells  to  classify  stimuli  as  matching  or  nonmatching,  on  a 
trial-by-trial  basis.  In  contrast  to  recent  findings  of  Eskandar  et  al.  (1992), 
we  find  no  advantage  in  using  the  first  three  principal  components  of  the 
responses  to  classify  stimuli  as  matching  or  nonmatching,  compared  to 
using  just  the  average  firing  rate.  Since  Eskandar  et  al.  found  significant 
temporal  variations  in  IT  responses  in  monkeys  performing  a  matching 
to  sample  task  with  short  delays  and  without  intervening  items,  the  two 
results  together  suggest  that  successive  stimulus  presentations  do  cause 
significant  temporal  variations  in  neuronal  responses  but  that  do  not  span 
long  delays  or  intervening  items  (although  other  explanations  are  possible, 
including  differences  in  recording  sites  in  IT  cortex).  These  results  do  not, 
of  course,  rule  out  "fast"  temporal  mechanisms  such  as  synchronized  fir¬ 
ing  among  IT  cells,  which  could,  in  principle,  cause  virtually  instantaneous 
changes  in  firing  rates  among  coupled  IT  cells  (see  chapter  5  by  Koch  and 
Crick  and  chapter  10  by  Singer). 

We  have  also  examined  activity  during  the  delay  intervals  of  the  match¬ 
ing  task,  to  see  whether  cells  might  maintain  a  representation  of  the  sample 
in  memory  through  maintained  activity  in  the  retention  interval  (Fuster 
and  Jervey  1981;  Mikami  and  Kubota  1980;  Miyashita  and  Chang  1988; 
Miyashita  1988;  Fuster  1990;  Sakai  and  Miyashita  1991).  Although  a  quar¬ 
ter  of  the  cells  show  stimulus-specific  activity  in  the  delay  interval  immedi¬ 
ately  following  the  sample  stimulus,  this  activity  appears  to  be  "reset"  by 
the  first  intervening  item  (Miller  et  al.  1993).  That  is,  after  the  first  interven¬ 
ing  test  item,  the  amount  of  activity  in  the  subsequent  delay  interval  seems 
to  be  determined  more  by  the  intervening  item  than  by  the  sample  stimu¬ 
lus  in  memory.  Thus,  it  is  unlikely  that  maintained  IT  activity  in  the  delay 
mediates  the  memory  of  the  sample  in  this  particular  short-term  memory 
task.  Nonetheless,  as  we  will  see  later  in  the  chapter,  maintained  activity 
during  delay  periods  is  a  prominent  feature  of  neuronal  activity  in  some 
tasks.  Further,  we  have  evidence  that  the  delay  activity  can  be  switched 
on  and  off  under  the  monkey's  voluntary  control  (Chelazzi  and  Desimone, 
unpublished  data).  Maintained  activity  during  retention  intervals,  easily 
disrupted  by  intervening  stimuli,  may  nonetheless  serve  as  a  type  of  visual 
"rehearsal,"  that  helps  solidify  memory  traces  (for  example,  its  effect  on 
memory  may  be  equivalent  to  that  of  a  longer  stimulus  presentation  time), 
or  as  a  kind  of  visual  "sketchpad"  (Baddeley  1986)  for  comparing  stimuli 
presented  in  different  modalities  or  different  spatial  locations  (see  below). 

The  sensitivity  of  IT  neurons  to  repetition  suggests  an  analogy  to  figure- 
ground  separation  in  the  spatial  domain.  Many  cells  at  all  levels  of  the 
visual  system  respond  best  to  contrast  of  some  sort.  For  these  cells,  the 
greater  the  similarity  between  the  stimulus  in  the  receptive  field  and  those 
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in  the  silent  surround  (stimulation  of  which  does  not  elicit  responses  by  it¬ 
self)/  the  more  the  response  to  the  receptive  field  stimulus  is  suppressed.  As 
one  advances  through  the  visual  system,  the  stimulus  features  that  are  con¬ 
trasted  may  become  more  sophisticated  and  the  spatial  areas  over  which 
the  interactions  occur  may  become  larger.  Based  on  these  properties,  it  has 
been  conjectured  that  one  of  the  functions  of  visual  cortex  is  to  separate 
figures  from  background  (Allman  et  al.  1985a;  Desimone  et  al.  1985). 

The  properties  of  IT  neurons  suggest  that  figure-ground  separation  oc¬ 
curs  in  the  temporal  domain  as  well.  As  a  result,  stimuli  that  have  not 
been  recently  seen  or  are  unexpected  may  tend  to  pop  out  from  an  array 
of  repeated  items.  In  a  sense,  the  past  functions  as  the  surround,  which  is 
compared  with  the  present  stimulus  in  the  receptive  field.  This  temporal 
figure-ground  extraction  may  occur  automatically,  as  repetition  effects  in 
IT  cortex  have  been  found  both  in  passively  fixating  and  in  anesthetized 
animals  (Miller  et  al.  1991a). 

If  the  analogy  with  spatial  receptive  fields  is  valid,  temporal  figure- 
ground  extraction  is  probably  not  unique  to  IT  cortex  but  may  be  an  in¬ 
trinsic  property  of  visual  cortex.  Figure-ground  separation  in  the  spatial 
domain  appears  to  build  incrementally  as  one  moves  through  the  visual 
system,  and  there  may  be  a  comparable  build  up  of  temporal  process¬ 
ing.  Nelson  (1991),  for  example,  finds  that  cells  in  striate  cortex  of  the 
cat  show  orientation-specific  suppression  lasting  a  few  hundred  millisec¬ 
onds.  Orientation-specific  temporal  interactions  are  also  found  in  area  V4 
(Haenny  et  al.  1988;  Maunsell  et  al.  1991).  Our  own  preliminary  evidence 
in  V4  in  monkeys  performing  the  same  task  and  viewing  the  same  stimuli 
that  we  used  in  IT  cortex  suggests  that  suppression  with  repetition  occurs 
in  V4  (Miller,  Li,  and  Desimone,  unpublished  data).  Presumably,  the  tem¬ 
poral  interval  over  which  stimuli  are  compared  is  much  smaller  (and  less 
able  to  span  intervening  stimuli)  in  earlier  visual  areas,  and  the  features 
that  are  compared  are  less  complex  than  in  IT  cortex.  The  results  of  the  ex¬ 
periments  described  below  indicate  that,  in  IT  cortex  at  least,  the  temporal 
interval  over  which  stimuli  are  compared  can  span  periods  comparable  to 
long-term  memory. 

The  notion  of  temporal  figure-ground  may  also  have  applications  to 
artificial  visual  systems.  Current  artificial  systems  typically  separate  vi¬ 
sual  processing  from  visual  memory,  whereas  the  biological  visual  system 
seems  to  integrate  both  at  relatively  early  stages,  building  on  them  incre¬ 
mentally.  This  integration  and  incremental  build-up  may  not  only  result 
in  processing  efficiencies  and  in  a  natural  interface  to  attentional  systems, 
but  may  also  simplify  visual  recognition,  which  would  not  have  to  be  ac¬ 
complished  in  one  final  stage. 

REPRESENTATION  OF  FAMILIARITY 

Physiological  studies  of  short-term  memory,  like  those  described  above, 
typically  use  stimuli  that  are  highly  familiar  to  the  monkey,  and  mnemonic 


80 


Desimone,  Miller,  and  Chelazzi 


effects  are  confined  to  a  single  trial  of  the  monkey's  task.  However,  by  test¬ 
ing  the  responses  to  novel  stimuli  that  the  monkey  has  never  seen  before, 
one  can  observe  adaptive  memory  filtering  spanning  time  periods  consis¬ 
tent  with  long-term  memory  formation  for  stimuli  in  IT  cortex  (Miller  et 
al.  1991b;  Li  et  al.  1993).  As  with  short-term  memory,  we  use  the  term 
"long-term  memory"  here  in  a  generic  sense,  without  (initially)  claiming 
any  specific  type  or  component.  The  behavioral  task  we  use  to  study  long¬ 
term  memory  is  basically  the  same  as  in  the  short-term  memory  study, 
but  we  monitor  changes  in  the  response  to  the  initially  novel  sample  stim¬ 
uli  over  the  session,  as  the  animal  gradually  becomes  familiar  with  them. 
Thus,  we  study  the  memories  that  are  incidentally  acquired  during  task 
performance.  For  a  third  of  the  cells,  the  response  to  novel  stimuli  sys¬ 
tematically  declines  as  the  stimuli  become  familiar  to  the  animal  over  the 
course  of  an  hour-long  recording  session.  This  decline  is  stimulus-specific 
and  is  found  even  when  more  than  150  other  stimuli  (the  maximum  tested) 
intervene  between  repeated  presentations.  Virtually  all  of  these  cells  also 
show  selectivity  for  particular  visual  stimuli  and,  thus,  are  not  "novelty 
detectors"  in  the  sense  of  cells  that  respond  to  any  novel  stimulus.  Rather, 
IT  cells  combine  sensitivity  to  novelty  and  familiarity  with  sensitivity  to 
other  objects  features  such  as  shape  and  color.  Furthermore,  as  shown  in 
figure  4.2,  these  same  cells  show  suppression  to  matching  stimuli  within  a 
trial,  which  is  added  to  the  suppression  caused  by  familiarity.  Thus,  these 
cells  appear  to  communicate  both  types  of  information — both  recency  and 
familiarity — and  the  effects  of  memory  appear  to  be  additive.  As  in  the 
short-term  memory  experiment,  at  least  half  the  cells  convey  sensory  in¬ 
formation  only  and  are  unaffected  by  the  contents  of  memory. 

To  assess  the  possible  role  of  feedback,  we  measure  the  time  course  of 
the  response  suppression  to  familiar  stimuli  in  the  population  of  cells.  We 
compare  the  average  population  response  to  novel  sample  stimuli  and  to 
the  same  stimuli  after  they  had  been  seen  just  once  before.  The  response 
waveform  to  the  stimuli  seen  once  before  does  not  show  suppression  until 
170-180  msec  after  stimulus  onset.  This  is  clearly  enough  time  for  the 
suppression  to  be  mediated  by  feedback  to  IT.  However,  after  the  stimuli 
had  been  seen  as  samples  just  one  additional  time,  suppression  is  evident 
from  the  very  onset  of  the  visual  response,  80  msec  after  stimulus  onset. 
Thus,  after  the  first  couple  of  presentations  of  a  stimulus,  IT  networks 
appear  to  detect  familiar  stimuli  on  their  own. 

Because  fewer  IT  cells  respond  strongly  to  a  stimulus  after  it  has  be¬ 
come  familiar,  familiarity  may  cause  a  "focusing,"  or  narrowing,  of  activ¬ 
ity  across  IT  cell  populations,  resulting  in  a  sparser  representation  of  the 
stimulus  (see  chapter  1  by  Barlow  and  chapter  8  by  Poggio  and  Hurlbert). 
Cells  that  poorly  represent  the  important  features  of  a  given  familiar  stim¬ 
ulus  may  be  winnowed  out  of  the  population  of  strongly  activated  cells 
through  the  adaptive  filtering  mechanism,  in  much  the  same  way  that  an 
excess  of  cells  and  connections  are  pruned  during  development.  If  the 
remaining  activated  cells  have  the  appropriate  association  with  cells  cod- 
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Number  of  trials  measured  from  first  trial 

Figure  4.2  Average  responses  of  IT  neurons  whose  responses  declined  significantly  with 
increasing  stimulus  familiarity  over  the  recording  session.  The  curves  show  the  average 
response  to  20  initially  novel  stimuli  tested  for  each  cell  (each  cell  was  tested  with  a  new 
set).  The  solid  line  indicates  the  responses  to  the  stimuli  when  they  were  samples,  the  dotted 
line  indicates  responses  to  the  same  stimuli  when  they  were  matching  test  items  at  the  end 
of  each  trial,  and  the  dashed  line  gives  the  baseline  (prestimulus)  firing  rate.  Trial  number 
is  measured  from  the  first  trial  of  a  given  stimulus.  The  staircase  appearance  results  from 
two  different  numbers  of  intervening  trials  between  successive  presentations  of  a  given  novel 
stimulus  as  the  sample.  Three  or  35  trials,  in  alternation,  intervened  before  a  given  stimulus 
was  repeated  as  the  sample  on  another  trial.  Responses  declined  most  when  only  3  trials 
intervened  but  the  decrement  was  retained  (''remembered")  even  when  35  trials  intervened. 
The  greater  decrement  after  3  trials  demonstrates  that  the  decrement  is  stimulus-specific 
and  cannot  be  explained  by  simple  fatigue.  If  the  cells  were  simply  becoming  fatigued,  the 
decrement  should  have  been  greater  after  35  intervening  trials,  as  the  cells  were  stimulated 
by  far  more  stimuli  during  that  interval  than  when  only  three  trials  intervened.  (Adapted 
from  Li  et  al.  1993) 


ing  contextual  information  (the  circumstances  in  which  the  stimuli  were 
seen,  for  example),  they  might  mediate,  in  part,  an  explicit  memory  of  the 
familiar  stimulus. 

On  the  other  hand,  the  adaptive  filtering  mechanism  might  mediate 
priming  phenomena,  which  reflect  implicit  memory.  In  a  typical  priming 
task  for  visual  patterns,  subjects  are  first  shown  a  list  of  drawings,  without 
any  instruction  to  remember  them.  Later,  they  are  given  a  picture  recogni¬ 
tion  task,  and  their  performance  is  usually  faster  or  better  for  the  stimuli 
that  had  been  seen  before,  even  if  they  have  no  conscious  memory  of  hav¬ 
ing  seen  them  (Schacter  et  al.  1990,  1991;  Squire  1992).  It  is  commonly 
believed  that  priming  is  due  to  a  tendency  of  neuronal  populations  to  be 
more  easily  activated  if  they  have  been  activated  previously  (see  chapter  9 
by  Posner  and  Rothbart).  For  individual  cells,  we  have  shown  that  this  is 


82 


Desimone,  Miller,  and  Chelazzi 


not  the  case,  at  least  in  IT  cortex.  However,  as  we  suggest  above,  it  is  the 
elimination  of  certain  cells  from  the  activated  population  that  may  be  im¬ 
portant  in  forming  the  underlying  neuronal  representation  of  a  stimulus. 
Repetition  may  speed  the  construction  of  this  critical  population,  resulting 
in  faster  and  better  recognition  when  a  stimulus  is  repeated.  Results  from 
a  recent  PET  study  of  priming  are  consistent  with  the  idea  that  priming  is 
associated  with  a  reduction  of  activity  in  the  cortex.  Subjects  performing 
a  word-stem  completion  task  showed  less  activation  of  temporal  cortex 
when  they  had  previously  seen  the  words  (Squire  et  al.  1992). 

ADAPTIVE  FILTERING  CELLS  AND  BEHAVIOR 

Could  the  responses  of  IT  cells  actually  support  the  animal's  behavioral 
performance  in  recency  memory  tasks  or  in  tasks  requiring  judgments  of 
novelty  and  familiarity?  We  have  used  both  simulated  neural  networks 
and  statistical  models  (discriminant  analysis)  to  classify  stimuli  as  match¬ 
ing  or  nonmatching  based  on  the  trial-by-trial  responses  of  individual  IT 
neurons  (Miller  et  al.  1993).  The  networks  or  models  are  trained  on  half 
the  data  and  then  applied  to  the  other  half,  for  cross-validation.  Although 
no  individual  cell  performs  as  well  as  the  animal  as  a  whole,  we  find  that 
we  can,  in  principle,  achieve  the  animal's  behavioral  level  of  performance 
by  averaging  the  responses  of  only  25  neurons.  We  would  expect  to  ob¬ 
tain  similar  results  from  the  novelty-familiarity  data.  Thus,  mnemonic 
information  equivalent  to  the  animal  as  a  whole  is  apparently  distributed 
down  to  the  level  of  small  neural  populations.  Comparable  results  have 
been  found  in  area  MT  for  motion  information  (Britten  et  al.  1992). 

The  classification  model  we  used  assumes  that  the  decision  networks 
which  interpret  the  outputs  of  IT  cells  "know"  (or  have  been  trained  on) 
the  match-nonmatch  response  distributions  of  the  cells  for  specific  stimuli. 
One  can  avoid  this  assumption  if  the  decision  network  is  supplied  with 
information  from  both  the  adaptive  cells  in  IT  as  well  as  from  the  cells  that 
provide  purely  sensory  information.  The  responses  of  these  latter  cells, 
which  make  up  about  50%  of  the  total  population  of  cells  in  IT  cortex,  would 
provide  a  stable  sensory  "referent"  for  comparison  with  responses  of  the 
adaptive  cells.  As  shown  in  figure  4.3,  the  difference  in  response  of  the  two 
populations  would  be  proportional  to  the  similarity  of  the  current  stimulus 
to  stimuli  held  in  either  short-  or  long-term  memory.  If  the  responses 
of  the  minority  of  cells  showing  match  enhancement  were  added  to  the 
sensory  pool,  the  magnitude  of  the  response  difference  would  be  even 
larger,  resulting  in  a  better  signal-to-noise  ratio. 

Beyond  its  role  in  memory  tasks,  we  believe  that  the  activation  level 
of  adaptive  cells  in  IT  cortex  (or  the  mismatch  signal  between  the  adap¬ 
tive  cells  and  the  sensory  cells)  may  be  an  important  drive  on  attentional 
and  orienting  systems.  Although  no  individual  cells  in  IT  cortex  appear 
to  be  either  "novelty  detectors"  or  "unexpected  stimulus  detectors,"  the 
summed  activity  of  adaptive  cells  in  IT  cortex  could  provide  a  signal  to 
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Figure  4.3  Model  of  how  sensory  cells  and  adaptive  mnemonic  filter  cells  in  IT  cortex  con¬ 
tribute  to  novelty  or  match-nonmatch  decisions.  The  plus  and  minus  are  meant  to  indicate 
that  the  outputs  of  the  adaptive  and  sensory  cells  work  in  opposition,  and  are  not  meant  to 
imply  excitation  or  inhibition.  (Adapted  from  Miller  et  al.  1993) 


other  systems  that  the  current  stimulus  is  new  and  deserving  of  attention. 
Conversely,  it  could  be  regarded  as  a  mechanism  for  discounting  the  fa¬ 
miliar  and  the  expected.  As  Barlow  points  out  in  chapter  1,  discounting 
the  expected  features  of  the  incoming  sensory  messages  seems  to  be  an  es¬ 
sential  element  of  effective  learning.  As  the  organism  orients  and  attends 
to  a  new  stimulus,  activated  IT  cell  populations  shrink  to  the  critical  set 
necessary  for  representing  the  stimulus.  This  shrinkage  reduces  the  overall 
activity  in  IT  cortex,  reducing  the  drive  on  the  orienting  system  and  freeing 
the  organism's  attention  for  other,  competing,  stimuli.  Thus,  it  is  behavior 
that  completes  the  loop.  One  could  view  this  as  a  memory-guided  system 
for  self-directed  behavior,  whose  goal  is  to  incorporate  knowledge  about 
new  stimuli  into  the  structure  of  the  cortex  (see  figure  4.4).  Carpenter 
and  Grossberg's  (1987)  Adaptive  Resonance  Theory  (ART)  makes  use  of 
a  similar  sort  of  feedback  between  memory  and  attention  as  a  means  for 
adaptively  adjusting  the  categories  in  which  items  are  stored  in  memory. 

The  adaptive  filtering  mechanism  suggests  that  attention  will  be  auto¬ 
matically  biased  toward  stimuli  that  are  novel  or  have  not  been  recently 
seen.  However,  it  is  frequently  necessary  to  voluntarily  attend  to  familiar 
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Figure  4.4  Interaction  between  systems  for  memory  and  attention.  In  this  scheme,  a  stimulus 
that  is  novel  or  has  not  been  recently  seen  will  activate  adaptive  memory  filter  cells  in  IT 
cortex,  which  in  turn  will  drive  attentional  and  orienting  systems.  This  will  lead  to  increased 
attention  and  contact  with  the  new  stimulus,  causing  adaptation  of  synaptic  weights  in  IT 
cortex,  reducing  the  activation  of  the  cells.  When  the  novelty  of  the  new  stimulus  and  the 
activation  of  IT  cortex  is  sufficiently  diminished,  the  system  will  be  ready  to  process  other 
competing  stimuli.  (Adapted  from  Li  et  al.  1993) 


or  expected  stimuli  and  even  suppress  attention  to  novel  stimuli  (e.g.,  when 
looking  for  a  pariticular  item  or  when  trying  to  "pay  attention").  How  this 
might  be  accomplished  neurally  is  considered  in  the  next  section. 

ACTIVE  MEMORY  MECHANISMS 

It  is  commonly  assumed  that  the  standard  matching  to  sample  task  that  we 
and  others  have  used  is  solved  by  holding  the  sample.  A,  "in  mind"  (i.e., 
in  some  short  term  storage  buffer)  during  the  delay  and  comparing  each 
of  the  test  items  to  it.  That  is,  the  memory  mechanism  is  thought  to  require 
an  active  process,  or  "working  memory,"  linked  specifically  to  the  sample 
stimulus,  much  the  way  one  might  rehearse  a  new  phone  number.  How¬ 
ever,  in  the  standard  form  of  the  task,  the  sample  stimulus  (e.g.,  "A")  is  the 
only  stimulus  repeated  within  a  given  trial  (e.g.,  "A  B  C  D  A").  Conceivably, 
then,  the  task  might  as  well  be  solved  by  a  neural  mechanism  that  auto¬ 
matically  detects  any  stimulus  repetition  (e.g.,  "A  A"),  without  requiring 
active  maintenance  of  the  sample  stimulus  memory.  Does  the  adaptive  fil¬ 
tering  mechanism  function  automatically,  or  does  it  require  active  storage 
of  the  sample  item?  To  distinguish  between  these  possibilities,  we  record 
from  IT  neurons  in  two  versions  of  the  matching  to  sample  task  (Miller  and 
Desimone  1994).  One  version  is  the  same  as  used  previously,  which  may 
be  solved  by  detecting  any  repetition  within  a  trial  ("A  B  C  A").  In  the  other 
version,  intervening  nonmatch  items  during  the  delay  period  may  match 
each  other,  but  the  animal  must  ignore  these  and  respond  only  to  the  one 
item  that  matches  the  sample  (e.g.,  "A  B  B  A,"  where  only  A  is  a  match). 
We  will  refer  to  these  as  the  standard  and  ABBA  versions,  respectively. 

Interestingly,  animals  initially  taught  the  standard  task  respond  to  the 
repeated  nonmatch  stimuli  when  first  presented  with  ABBA  trials  (i.e.,  they 
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respond  to  the  second  occurrence  of  "B").  The  animals  therefore  apparently 
learn  to  solve  the  task  using  a  simple  repetition  rule.  After  additional  train¬ 
ing,  animals  eventually  learn  to  match  items  to  only  the  sample  stimulus. 

After  the  monkeys  have  learned  the  ABBA  task,  we  still  find  adaptive 
filtering  cells  whose  responses  are  suppressed  to  the  matching  stimulus. 
However,  the  ABBA  trials  reveal  that  the  responses  of  these  cells  are  also 
suppressed  to  the  intervening  stimuli  that  match  each  other.  Thus,  the 
adaptive  filtering  mechanism  seems  to  be  sensitive  to  simple  repetition.  It 
should  be  noted  that  human  observers  easily  notice  the  repeated  interven¬ 
ing  stimuli  when  performing  the  ABBA  task,  at  the  same  time  that  they  are 
able  to  detect  the  one  stimulus  that  matches  the  sample. 

By  contrast,  another  class  of  IT  cells  gives  enhanced  responses  to  match¬ 
ing  stimuli  (which  was  rarely  found  prior  to  the  training  on  the  ABBA  trials) 
but  only  to  the  one  stimulus  that  matches  the  sample  item  on  a  given  trial, 
that  is,  the  one  stimulus  the  animal  is  actively  holding  in  memory.  Simple 
stimulus  repetition  has  no  effect  on  these  cells,  as  repeated  intervening 
items  show  no  enhancement.  These  cells  apparently  can  be  dynamically 
"biased"  to  respond  to  a  specific  stimulus,  and  this  bias  can  span  many  sec¬ 
onds  and  at  least  several  intervening  stimuli.  Such  a  bias  is  presumably  me¬ 
diated  by  "back  projections"  to  IT  cortex  (see  chapter  2  by  Churchland,  Ra- 
machandran,  and  Sejnowski  and  chapter  12  by  Ullman),  possibly  from  pre¬ 
frontal  cortex.  The  fact  that  the  bias  is  apparently  under  the  animal's  control 
suggests  a  relation  to  the  active  components  of  "working  memory."  In  this 
case,  IT  cortex  appears  to  be  the  site  of  dual  short-term  memory  mech¬ 
anisms  operating  in  parallel:  a  suppressive  mechanism  underlying  the 
automatic  detection  of  repetition  and  an  enhancement  mechanism  under¬ 
lying  working  memory.  In  the  next  section,  we  consider  how  these  different 
memory  mechanisms  function  when  viewing  a  typical  crowded  scene. 

VISUAL  SEARCH 

In  visual  search  for  a  particular  object,  the  representation  of  the  object  in 
memory  is  used  to  guide  the  search  of  the  external  scene  (e.g.,  searching 
for  a  face  in  a  crowd).  This  type  of  search  is  often  distinguished  from 
"preattentive"  or  "pop-out"  tasks,  where  a  subject  finds  a  stimulus  that 
stands  out  from  its  background  on  the  basis  of  a  strong  featural  cue  such  as 
color  or  luminance.  Most  agree  that  such  stimuli  are  extracted  from  their 
background  based  on  a  parallel  process  operating  over  the  full  display. 
However,  there  are  two  competing  notions  of  how  search  of  the  former 
sort  takes  place.  We  will  describe  the  two  extremes,  but  hybrid  models 
are  possible.  According  to  the  serial  search  account,  the  scene  is  searched 
element  by  element  by  a  "spotlight"  of  attention  (Bergen  and  Julesz  1983; 
Treisman  1988).  The  element  selected  by  attention  is  evaluated  by  a  recog¬ 
nition  memory  process,  which  is  terminated  when  the  target  is  found.  As 
elements  are  added  to  the  scene,  it  takes  longer  and  longer  to  find  the  tar¬ 
get.  By  contrast,  according  to  parallel  search  accounts,  all  elements  of  the 
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scene  are  processed  in  parallel  and  compete  for  access  to  decisional  mecha¬ 
nisms,  attentional  mechanisms,  and  so  on  (Duncan  and  Humphreys  1989; 
Bundesen  1992).  The  mnemonic  template  of  the  searched-for  object  biases 
the  competition  in  favor  of  the  neurons  coding  that  particular  object,  much 
the  same  way  that  a  strong  color  difference  may  bias  the  competition  to¬ 
wards  a  unique  element  in  a  "pop-out"  display.  As  elements  are  added 
to  the  scene,  it  takes  longer  to  find  the  target  because  of  the  reduction  in 
signal-to-noise  ratio.  Attentional  scrutiny  may  follow  the  localization  of 
the  target  according  to  this  view,  but  is  not  essential  to  find  it.  We  have 
recently  begun  to  investigate  the  neural  basis  of  this  type  of  visual  search 
in  anterior  IT  cortex  (Chelazzi  et  al.  1993).  Although  we  are  not  yet  able  to 
provide  a  full  account  of  the  mechanism  of  search,  the  results  show  how 
memory  and  attention  interact  in  IT  cortex. 

Monkeys  are  presented  with  a  complex  picture  (the  cue)  at  the  center  of 
gaze  to  hold  in  memory.  The  cue  is  always  either  a  "good"  stimulus  that 
elicits  a  strong  response  from  the  cell  or  a  "poor"  stimulus  that  elicits  little 
or  no  response.  Both  good  and  poor  cues  are  highly  familiar  to  the  monkey. 
After  a  delay,  the  good  stimulus  and  the  poor  stimulus  are  both  presented 
simultaneously  as  choice  stimuli,  at  extrafoveal  locations  in  the  contralat¬ 
eral  visual  field.  Because  of  the  large  receptive  fields  of  IT  neurons,  the  two 
choice  stimulus  locations  are  typically  both  within  the  receptive  field.  The 
animal  is  trained  to  make  a  saccadic  eye  movement  to  the  target  stimulus 
that  matches  the  cue,  ignoring  the  nonmatching  stimulus  (the  "distractor"). 
Thus,  unlike  in  our  previous  short-term  memory  experiments,  the  animal 
does  not  have  to  indicate  whether  a  matching  stimulus  was  present  (one 
was  always  present  in  the  choice  array),  but  rather  must  find  the  matching 
stimulus,  or  separate  it  from  the  nonmatching  stimulus  in  the  array.  Any 
suppression  of  response  to  the  matching  stimulus  in  this  task  might  place 
it  at  a  competitive  disadvantage  to  the  nonmatching  stimulus,  possibly 
interfering  with  its  ability  to  capture  attention.  A  further  difference  is  that 
the  cue  and  the  matching  choice  stimulus  are  never  presented  at  the  same 
retinal  location,  that  is,  the  cue  and  matching  target  never  activate  the  same 
retinal  elements.  It  is  possible,  therefore,  that  the  animal  must  compare  the 
choice  stimuli  with  a  more  abstract  representation  of  the  cue. 

We  find  that  the  cue  typically  initiates  activity  that  persists  through  the 
following  delay  period  among  the  neurons  that  are  tuned  to  the  cue's 
features.  Although  the  frequency  of  this  maintained  activity  is  relatively 
low  (average  =  7.9  Hz  following  best  cue  and  5.6  Hz  following  worst  cue) 
it  is  much  more  common  than  in  our  previous  memory  experiments  and 
lasts  for  a  considerable  time  (at  least  3  sec,  the  maximum  delay  tested). 

In  addition  to  information  about  the  cue,  IT  cells  communicate  informa¬ 
tion  about  the  target.  Relative  to  the  preceding  delay  activity,  the  initial 
population  response  to  the  array  is  about  the  same  regardless  of  which 
choice  stimulus  is  the  target.  By  contrast,  the  late  phase  of  the  response 
changes  dramatically  depending  on  whether  the  animal  is  about  to  make 
an  eye  movement  to  the  good  or  poor  stimulus.  If  the  target  is  the  good 
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Figure  4.5  Average  firing  rates  of  IT  cells  to  choice  arrays  in  a  visual  search  task,  in  which 
either  the  good  (solid  lines)  or  poor  (dashed  lines)  stimulus  was  the  target,  (a)  Responses 
time-locked  to  array  onset.  Average  time  of  saccade  onset,  306  msec  after  array  onset,  is 
indicated  by  an  asterisk,  (b)  Same  data  as  in  (fl),  but  responses  are  time-locked  to  saccade 
onset.  Data  in  all  graphs  are  from  trials  in  which  the  target  and  distractor  appeared  in  the 
hemifield  contralateral  to  the  recorded  cell,  whether  the  target  was  in  the  upper  quadrant 
and  the  distractor  in  the  lower,  or  vice  versa.  (Adapted  from  Chelazzi  et  al.  1993) 


stimulus,  the  response  remains  high,  but  if  the  target  is  the  poor  stimulus, 
the  response  to  the  good  distractor  stimulus  is  suppressed  even  though  it 
is  still  within  the  receptive  field  (figure  4.5).  This  suppression  of  response 
begins  about  200  msec  after  the  onset  of  the  choice  array,  or  about  90-120 
msec  before  the  start  of  the  eye  movement.  The  cells  respond  as  though 
100  msec  or  so  before  the  eye  movement,  the  target  stimulus  "captures"  the 
response  of  the  cells,  so  that  neuronal  activity  in  IT  would  reflects  only  the 
target's  properties.  This  target  information  in  IT  cortex  is  available  to  drive 
attentional  as  well  as  oculomotor  systems,  resulting  in  eye  movements  to 
the  chosen  object  (Glimcher  and  Sparks  1992).  More  generally,  networks 
in  IT  cortex  may  select  the  visual  objects  that  are  acted  on  by  attentional 
and  motor  systems  when  selection  is  guided  by  object  features. 

These  effects  are  found  when  target  and  distractor  are  both  within  the 
hemifield  contralateral  to  the  recorded  neuron.  By  contrast,  there  is  little 
effect  on  responses  prior  to  the  eye  movement  when  the  stimuli  are  sepa¬ 
rated  by  the  vertical  meridian.  Moran  and  Desimone  (unpublished  data) 
and  Sato  (1988)  also  found  reduced  effects  of  spatially  directed  attention 
in  IT  cortex  when  attended  and  ignored  stimuli  were  located  in  opposite 
hemifields,  suggesting  that  stimuli  in  the  two  hemifields  may  be  processed 
largely  independently.  Although  IT  neurons  often  have  bilateral  fields,  the 
response  to  stimuli  in  the  contralateral  hemifield  is  typically  much  larger 
than  to  stimuli  in  the  ipsilateral  hemifield.  When  both  choice  stimuli  are  in 
the  contralateral  hemifield,  a  further  surprising  result  is  that  many  IT  cells 
respond  differently  depending  on  the  relative  spatial  locations  of  target 
and  distractor.  Thus,  some  spatial  (retinal)  information  appears  to  be  re- 
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Figure  4.6  Upper  diagrams  are  schematic  representations  of  activity  in  a  population  of  IT 
neurons  during  performance  of  the  task.  Each  dot  represents  an  individual  cell,  and  the  size 
of  the  dot  indicates  relative  firing  rate.  Lower  diagrams  illustrate  the  visual  displays  during 
the  relevant  portions  of  the  task.  A  specific  cue  activates  the  subpopulation  of  IT  cells  tuned 
to  any  of  the  various  features  of  the  cue.  During  the  delay  period,  this  subpopulation  remains 
more  active  than  other  cells.  When  the  choice  array  first  appears,  cells  are  initially  activated  by 
whichever  stimulus  they  prefer  in  the  array,  regardless  of  which  is  the  target.  In  some  cases  the 
initial  responses  are  identical  but  in  others  the  response  to  the  array  with  the  good  target  starts 
off  at  a  higher  rate  because  of  persisting  activity  from  the  delay.  Later,  within  90-120  msec  of 
saccade  onset,  the  cells  tuned  to  the  properties  of  the  target  stimulus  remain  active,  whereas 
cells  tuned  to  the  properties  of  the  distractor  are  suppressed.  Whether  this  final  divergence 
in  activation  results  from  competitive  interactions  within  IT  cortex  (e.g.,  through  mutual 
inhibition  between  cells  selective  for  target  and  distractor)  or  from  interactions  between  IT 
cortex  and  an  attentional  control  system  is  not  yet  known.  (Adapted  from  Chelazzi  et  al.  1993) 


tained  in  IT  cortex,  despite  the  large  receptive  fields.  Retinal  location  may 
simply  be  another  coarsely  coded  feature  in  IT  cortex  and  thus  be  available 
to  oculomotor  systems  for  directing  the  eyes. 

It  seems  as  though  we  have  narrowed  the  selection  component  of  vi¬ 
sual  search  to  a  200-msec  period  following  the  onset  of  a  peripheral  search 
array.  By  the  end  of  this  time,  neural  responses  begin  to  communicate 
almost  exclusively  the  properties  of  the  chosen  target.  Within  this  criti¬ 
cal  200-msec  period,  we  have  observed  the  same  mnemonic  phenomena 
we  observed  in  IT  cortex  in  the  memory  tasks  with  a  single  stimulus  in 
the  field,  namely  suppressed  responses  when  the  target  array  contains  a 
stimulus  that  matches  the  previously  seen  cue  (adaptive  filtering  cells), 
enhanced  responses  (for  other  cells)  when  the  good  stimulus  is  the  target, 
and  elevated  activity  in  the  delay  interval  when  the  good  stimulus  is  the 
cue /target.  Any  or  all  of  these  effects  on  responses  may  bias  competi¬ 
tion  within  IT  networks  such  that  only  those  cells  coding  the  target  are 
responding  by  the  end  of  200  msec  (figure  4.6). 

It  is  particularly  intriguing  that  the  maintained  activity  during  the  delay 
interval  is  much  more  pronounced  than  in  the  previous  memory  experi¬ 
ments  and,  in  some  conditions,  it  persists  into  the  initial  response  to  the 
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choice  array.  If  the  monkey  is  given  the  good  stimulus  as  a  cue,  for  example, 
the  cell  often  shows  higher  activity  during  the  delay,  increasing  the  initial 
response  to  the  array  with  the  good  stimulus  as  the  target.  This  heightened 
activity  may  give  those  cells  tuned  to  the  properties  of  the  target  a  compet¬ 
itive  advantage  within  IT,  causing  them  to  win  a  competition  against  the 
cells  tuned  to  the  properties  of  the  distractor.  Such  a  mechanism  would 
be  consistent  with  the  "parallel  search"  model  described  above.  Increas¬ 
ing  the  steady-state  firing  rate  of  a  population  of  cells  coding  a  particular 
stimulus  may  be  a  way  of  segregating,  or  binding,  those  cells  that  will  con¬ 
tribute  the  most  to  a  later  perceptual  decision.  Consistent  with  this  idea, 
we  have  recently  found  evidence  for  sustained  firing  of  V4  cells  when  at¬ 
tention  is  directed  to  the  center  of  the  cell's  receptive  field  (Luck  et  al.  1993). 
Alternatively,  the  higher  steady-state  activity  might  turn  out  to  be  simply  a 
reflection  of  greater  synchronous  firing  among  the  relevant  population  of 
neurons,  in  which  case  the  synchronicity  itself  could  be  the  binding  mech¬ 
anism,  not  the  maintained  activity  (see  chapter  5  by  Koch  and  Crick  and 
chapter  10  by  Singer). 

Finally,  although  we  have  described  the  responses  of  IT  cells  as  though 
all  of  the  critical  interactions  take  place  within  IT  cortex  itself,  it  is  possible 
that  the  ultimate  suppression  of  IT  responses  to  distractors  is  due  to  inter¬ 
actions  with  attentional  systems  outside  of  IT  cortex.  A  spatial  attention 
system  could,  for  example,  gate  inputs  into  IT  from  one  stimulus  in  the 
array  at  a  time.  The  appropriate  modulation  of  IT  responses  by  the  joint 
action  of  the  selected  stimulus  and  the  stimulus  held  in  memory  would 
constitute  "recognition,"  terminating  the  scan.  Such  a  mechanism  would 
be  consistent  with  the  "serial  scan"  model  described  above.  Anatomical 
considerations  dictate  that  such  a  mechanism  could  operate  in  either  of  two 
ways  (Desimone  1992):  either  by  gating  the  inputs  into  IT  cells  (Anderson 
and  Van  Essen  1987)  or  by  gating  the  cells  themselves  on  and  off  (Crick  and 
Koch  1990b;  Niebur,  Koch,  and  Rosin  1993).  Different  approaches  to  atten¬ 
tional  gating  are  described  in  chapter  5  by  Koch  and  Crick  and  chapter  13 
by  Van  Essen,  Anderson,  and  Olshausen. 

CONCLUSIONS 

It  is  natural  to  think  of  attention  as  the  gateway  to  memory,  as  we  typically 
remember  only  those  things  that  we  attend  to.  Correspondingly,  ignored 
stimuli  are  filtered  from  the  receptive  fields  of  extrastriate  neurons.  In¬ 
formation  cannot  be  remembered  if  it  has  been  removed  from  the  visual 
system.  However,  as  we  have  seen,  the  contents  of  memory  also  guide  our 
attention.  Memory  for  objects  is  reflected  both  in  the  maintained  activity 
of  IT  cells  in  the  absence  of  any  stimuli  as  well  as  in  the  responses  of  cells 
to  current  stimuli.  New  or  not-recently-seen  stimuli  cause  the  greatest  ac¬ 
tivation  of  adaptive  filtering  cells  in  IT  cortex,  and  the  difference  in  overall 
activity  between  these  cells  and  sensory  cells  unaffected  by  memory  may 
be  one  of  the  signals  that  drives  attentional  mechanisms.  This  temporal 
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"figure-ground"  mechanism  most  likely  builds  up  incrementally  within 
the  visual  cortex.  As  the  organism  attends  to  the  new  inputs,  activity  in 
the  adaptive  filtering  cells  declines,  reducing  the  drive  on  attentional  sys¬ 
tems.  Many  times,  though,  we  need  to  search  for  a  particular  familiar  object 
and  suppress  orienting  to  novel  ones.  In  this  case,  top-down  mechanisms 
are  able  to  bias  those  cells  that  code  the  searched-for  item,  resulting  in  en¬ 
hanced  activation  when  the  stimulus  occurs.  An  even  more  pronounced 
case  of  memory-guided  attention  occurs  in  visual  search,  where  the  rep¬ 
resentation  of  a  target  item  in  memory  is  used  to  guide  the  search  of  an 
external  scene  containing  many  stimuli.  Within  200  msec,  the  interaction 
between  the  neural  representation  of  the  external  array  and  the  memory 
trace  of  the  target  results  in  the  target  "capturing"  the  responses  of  IT  cells. 
Information  about  nontargets  is  almost  completely  suppressed.  Thus,  in¬ 
teractions  between  memory  and  attention  in  IT  cortex  result  in  the  selection 
of  objects  that  are  foveated  and  acted  on. 
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Some  Further  Ideas  Regarding  the  Neuronal 
Basis  of  Awareness 

Christof  Koch  and  Francis  Crick 


INTRODUCTION 

What  goes  on  in  our  head  while  we  perceive  a  seagull  gliding  effortlessly 
through  the  air?  And  what  happens  if  we  drive  down  the  freeway,  con¬ 
centrating  on  an  upcoming  lecture  and  barely  aware  of  our  visual  sur¬ 
roundings?  While  our  "mind"  is  occupied  with  perceiving  the  seagull  or 
contemplating  some  future  event,  our  brain  continues  to  perform  a  series  of 
quite  complex  functions  such  as  climbing  over  rocks  on  a  beach  or  maneu¬ 
vering  a  car.  What  causes  one  event  to  capture  our  attention  (i.e.,  to  have 
access  to  an  privileged  internal  state  that  is  central  to  our  conscious  expe¬ 
rience),  while  the  myriads  of  other  sensory  events  that  we  are  continously 
being  bombarded  with  never  reach  awareness?  Because  we  are  neurosci¬ 
entists,  we  are  couching  this  question  in  a  radical  reductionist  framework: 
What  is  the  correlate  of  awareness  (and  ultimately  consciousness)  at  the 
level  of  nerve  cells?  Expressed  differently:  What  telltale  signs  should  the 
electrophysiologist  look  for  when  searching  for  the  neuronal  correlate  of 
awareness?  A  particular  type  of  neuron  in  a  particular  part  of  the  brain? 
A  special  form  of  neuronal  electrical  activity? 

With  very  few  exceptions,  almost  no  modem  neuroscientist  has  asked 
this  sort  of  question,  let  alone  provided  an  answer.  The  problem  of  "aware¬ 
ness"  is  either  felt  to  be  purely  philosophical  or  too  elusive  to  study  ex¬ 
perimentally.  In  our  opinion,  such  timidity  in  the  face  of  one  of  the  most 
puzzling  questions  that  we  can  ask  is  ridiculous!  Given  the  pre-Copemican 
state  of  the  field,  it  is  too  early  for  any  definite  theory  of  the  neuronal  basis 
of  awareness.  Yet  we  believe  that  the  time  is  ripe  to  provide  at  least  a 
theoretical  framework  to  allow  us  to  seek  answers  using  the  well-proven 
tools  of  the  neuroscience,  in  particular  electrophysiology! 

We  have  already  (Crick  and  Koch  1990a,b,  1992)  described  our  general 
approach  to  the  problem  of  visual  awareness.  In  brief,  we  believe  the  next 
important  step  is  to  find  experimentally  the  neural  correlates  of  various 
aspects  of  visual  awareness,  that  is,  how  best  to  explain  our  subjective 
mental  experience  in  terms  of  the  behavior  of  large  groups  of  nerve  cells.  At 
this  early  stage  in  our  investigation  we  will  not  worry  too  much  about  many 
fascinating  but  at  the  moment  unrewarding  aspects  of  the  problem,  such  as 


the  exact  function  of  visual  awareness,  what  species  do  and  what  species 
do  not  have  awareness,  different  forms  of  awareness  (such  as  dreams  and 
visual  imagination),  and  the  deep  problem  of  qualm.  We  here  restrict  our 
attention  mainly  to  results  on  man  and  on  the  macaque  monkey,  since  their 
visual  systems  appear  to  be  somewhat  similar  and,  at  the  moment,  we 
cannot  obtain  all  the  information  we  need  from  either  of  them  separately. 
When  this  information  is  lacking  we  will  refer  to  related  results  on  the  cat. 

Our  main  assumption  is  that,  at  any  moment,  the  firing  of  some  but  not 
all  the  neurons  in  what  we  call  the  visual  cortical  system  (which  includes 
the  neocortex  and  the  hippocampus  as  well  as  a  number  of  directly  as¬ 
sociated  structures,  such  as  the  visual  parts  of  the  thalamus  and  possibly 
the  claustrum)  correlates  with  visual  awareness.  Yet,  visual  awareness  is 
highly  unlikely  to  be  caused  by  the  firing  of  all  neurons  in  this  system  that 
happen  to  respond  above  their  background  rate  at  any  particular  moment. 
If  at  any  given  point  in  time  only  1%  of  all  the  neurons  in  cortex  fire  signif¬ 
icantly,  about  one  billion  cells  in  sensory,  motor,  and  association  cortices 
would  be  active  and  we  would  never  be  able  to  distinguish  any  particular 
event  out  of  this  vast  sea  of  active  nerve  cells.  We  strongly  expect  that  the 
majority  of  neurons  will  be  involved  in  doing  computations,  while  only  a 
much  smaller  number  will  express  the  results  of  these  computations.  It  is 
probably  that  we  become  aware  of  only  the  latter.  There  is  already  prelim¬ 
inary  evidence  from  the  study  of  the  firing  of  neurons  during  binocular 
rivalry  that  in  area  MT  of  the  macaque  monkey  only  a  fraction  of  neu¬ 
rons  follows  the  monkey's  percept  (Logothetis  and  Schall  1989).  We  can 
thus  usefully  ask  the  question:  What  are  the  essential  differences  between 
those  neurons  whose  firing  does  correlate  with  the  visual  percept  and  those 
whose  firing  does  not?  Are  the  "awareness"  neurons  of  any  particular  cell 
type?  Exactly  where  are  they  located,  how  are  they  connected,  and  is  there 
anything  special  about  their  patterns  of  firing? 

To  look  for  such  neurons  it  may  be  useful  to  have  some  tentative  ideas  as 
to  where  they  are  and  how  they  might  behave,  if  only  to  guide  experiments. 
The  following  suggestions  are  offered  in  that  spirit. 

In  our  previous  papers  (Crick  and  Koch  1990a,  1992)  we  outlined  the 
ideas  of  psychologists  that  consciousness  involves  a  form  of  short-lasting 
memory,  possibly  including  both  iconic  as  well  as  short-term  memory,  and 
that  it  appears  to  be  greatly  enriched  by  selective  attention.  Accordingly, 
we  need  to  search  for  a  mechanism  that  mediates  selective  visual  attention 
as  well  as  a  short  form  of  visual  memory. 

We  also  explained  that  there  does  not  appear  to  be  one  single  cortical 
area  whose  activity  corresponds  to  what  we  see.  The  necessary  visual 
information  is  distributed  over  many  cortical  areas.  Thus,  there  has  to 
be  a  process  that  in  some  way  binds  together  the  neural  activity  in  many 
different  places  to  form  our  unitary  view  of  an  object  or  event  in  the  visual 
scene.  How  this  is  done  is  often  referred  to  as  the  binding  problem. 

It  is  important  to  distinguish  at  least  three  different  forms  of  binding 
(Crick  and  Koch  1990a).  A  simple  cell,  responding  best  to  motion  perpen- 
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dicular  to  its  optimal  orientation,  can  be  thought  to  exemplify  a  low-level 
type  of  binding  (at  least  three  features,  spatial  location,  orientation  and  di¬ 
rection  of  motion  are  combined  in  one  such  cell).  The  second  type  of  bind¬ 
ing  is  acquired  by  learning  a  small  class  of  ecological  very  important  stim¬ 
uli,  such  as  faces  (or  letters  for  humans).  Due  to  repeated  exposure  to  these 
patterns,  a  set  of  neurons  has  become  uniquely  responsive  to  just  these 
stimuli.  Because  both  of  these  mechanisms  have  only  a  fairly  limited  ca¬ 
pacity,  a  third  type  of  very  rapid  and  transient  form  of  binding  with  a  prac¬ 
tically  infinite  capacity  is  proposed.  It  seems  probable  that  an  attentional 
mechanism  is  usually  necessary  for  this  rapid  and  transient  binding  to  oc¬ 
cur.  This  rapid  form  of  binding,  can,  by  overlearning,  eventually  be  carried 
out  by  a  specialized  set  of  neurons  (i.e.,  by  the  second  form  of  binding). 

We  also  hypothesized  that  in  addition  to  vivid  visual  awareness  there 
may  be  an  extremely  transient  form  of  awareness  that  symbolizes  only 
rather  primitive  features  without  binding  them  together.  We  will  return  to 
this  in  the  last  section. 

OUR  THEORETICAL  FRAMEWORK 

It  may  be  useful  to  state  our  'fundamental'  assumptions  at  the  beginning. 
We  assume  the  following: 

1.  To  be  aware  of  an  object  or  an  event  the  brain  has  to  construct  an  ex¬ 
plicit,  multilevel,  symbolic  interpretation  of  part  of  the  visual  scene.  By  explicit 
we  mean  that  one  such  neuron  (or  a  few  closely  associated  ones)  must  be 
firing  above  background  at  that  particular  time  in  response  to  the  feature 
they  symbolize.  The  pattern  of  color  dots  on  a  TV  screen,  for  instance, 
contains  an  "implicit"  representation  of,  say,  a  person's  face,  but  only  the 
dots  and  their  locations  are  made  explicit  here;  an  explicit  face  representa¬ 
tion  would  correspond  to  a  light  that  is  wired  up  in  such  a  manner  that  it 
responds  whenever  a  face  appears  somewhere  on  the  TV  screen.  By  mul¬ 
tilevel  we  mean,  in  psychological  terms,  different  levels  such  as  those  that 
correspond,  for  example,  to  lines,  eyes,  or  faces.  In  neurological  terms  we 
mean,  loosely,  the  different  levels  in  the  visual  hierarchy  (see  Felleman  and 
Van  Essen  1991).  By  symbolic ,  as  applied  to  a  neuron,  we  mean  that  neu¬ 
ron's  firing  is  strongly  correlated  with  some  "feature"  of  the  visual  world 
and  thus  symbolizes  it  (this  use  of  the  word  "symbol"  should  not  be  taken 
to  imply  the  existence  of  a  homunculus  who  is  looking  at  the  symbol).  The 
meaning  of  such  a  symbol  depends  not  only  on  the  neuron's  receptive  field 
(i.e.,  what  visual  features  the  neuron  responds  to)  but  also  on  what  other 
neurons  it  projects  to  (its  projective  field).  Whether  a  neural  symbol  is  best 
thought  of  as  a  scalar  (one  neuron)  or  a  vector  (a  group  of  closely  associated 
neurons  as  in  population  coding  in  the  superior  colliculus;  Lee  et  al.  1988) 
is  a  difficult  question  that  we  shall  not  discuss  here. 

2.  Awareness  results  from  the  firing  of  a  coordinated  subset  of  cortical  (and 
possible  thalamic)  neurons  that  fire  in  some  special  manner  for  a  certain 
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length  of  time,  probably  for  at  least  100  or  200  msec.  This  firing  needs  to 
activate  some  type  of  short-term  memory  by  either  strengthening  certain 
synapses  or  maintaining  an  elevated  firing  rate  or  both.  Experimental 
studies  involving  short-term  memory  tasks  in  the  temporal  lobe  of  the 
monkey  (Fusterand  Jervey  1981)  have  provided  evidence  of  elevated  firing 
rates  for  the  duration  of  the  interval  during  which  an  item  needs  to  be 
remembered.  It  is  at  present  not  possible  to  assess  empirically  to  what 
extent  synapses  undergo  a  short-term  change  during  a  memory  task  in  the 
animal.  We  are  assuming  that  the  semiglobal  activity  that  corresponds  to 
awareness  has  to  last  for  some  minimum  time  (of  the  order  of  100  msec) 
and  that  events  within  that  time  window  are  treated  by  the  brain  as  approximately 
simultaneous.  An  example  would  be  the  flashing  for  20  msec  of  a  red  light 
followed  immediately  by  20  msec  of  a  green  light  in  the  same  position. 
The  observer  sees  a  transient  yellow  light  (corresponding  to  the  mixture 
red  and  green)  and  not  a  red  light  changing  into  a  green  light  (Efron  1973). 
Other  psychophysical  evidence  shows  that  visual  stimuli  of  less  than  120- 
130  msec  produce  perceptions  having  a  subjective  duration  identical  to 
those  produced  by  stimuli  of  120-130  msec  (Efron  1970a,b). 

3.  Unless  a  neuron  has  an  elevated  firing  rate  and  unless  it  fires  as  a  mem¬ 
ber  of  such  an  (usually  temporary)  assembly,  its  firing  will  not  directly 
symbolize  some  feature  of  awareness  at  that  moment. 

These  ideas,  taken  together,  place  restrictions  on  what  sort  of  changes 
can  reach  awareness.  An  example  would  be  the  awareness  of  movement 
in  the  visual  scene.  Both  physiological  and  psychophysical  studies  have 
shown  that  movement  is  extracted  early  in  the  visual  system  as  a  primitive 
(by  the  so-called  short-range  motion  system;  Braddick  1980).  We  can  be 
aware  that  something  has  moved  (but  not  what  has  moved)  because  there 
are  neurons  whose  firing  symbolizes  movement  as  such,  being  activated 
by  certain  changes  in  luminance.  To  know  what  has  moved  (as  opposed 
to  a  mere  change  of  luminance)  there  must  be  active  neurons  somewhere 
in  the  brain  that  symbolize,  by  their  firing,  that  there  has  been  a  change  of 
that  particular  character.  In  any  instance,  such  neurons  may  be  present  in 
the  brain  but  it  cannot  be  assumed  that  they  must  be  there. 

3.1  As  a  corollary,  we  formulate  our  activity  principle:  Underlying  every 
direct  perception  is  a  group  of  neurons  strongly  firing  in  response  to  that 
stimulus  that  come  to  symbolize  it.  An  example  is  the  "Kanizsa  triangle" 
illusion,  in  which  three  Pacmen  are  situated  at  the  comers  of  an  triangle, 
with  their  open  mouths  facing  each  other.  Human  observers  see  a  white 
triangle  with  illusory  lines,  even  though  the  intensity  is  constant  between 
the  Pacmen.  As  reported  by  von  der  Heydt  et  al.  (1984),  cells  in  V2  of  the 
awake  money  strongly  respond  to  such  illusory  lines.  Another  case  is  the 
filling-in  of  the  blind  spot  in  the  retina  (Fiorani  et  al.  1992).  Since  we  do 
not  have  neurons  that  explicitly  represent  the  blind  spot  and  events  within 
it,  we  are  not  aware  of  small  objects  whose  image  projects  onto  them  and 
can  only  infer  such  objects  indirectly. 
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A  semiglobal  activity  that  corresponds  to  awareness  does  not  itself  sym¬ 
bolize  a  change  within  that  short  period  of  awareness  unless  such  a  change 
is  made  explicitly  by  some  neurons  whose  firing  makes  up  the  semiglobal 
activity  (for  what  else  but  another  group  of  neurons  can  express  the  notion 
that  a  change  has  occurred?).  These  ideas  are  very  counterintuitive  and  are 
not  easy  to  grasp  on  first  reading,  since  the  "fallacy  of  the  homunculus" 
slips  in  all  too  easily  if  one  does  not  watch  out  for  it. 

3.2  It  follows  that  active  neurons  in  the  cortical  system  that  do  not  take 
part  in  the  semiglobal  activity  at  the  moment  can  still  lead  to  behavioral 
changes  but  without  being  associated  with  awareness.  These  neurons  are 
responsible  for  the  large  class  of  phenomena  that  bypass  awareness  in 
normal  subjects,  such  as  automatic  processes,  priming,  subliminal  percep¬ 
tion,  learning  without  awareness,  and  others  (Tulving  and  Schacter  1990; 
Kihlstrom  1987)  or  take  part  in  the  computations  leading  up  to  awareness. 
In  fact,  we  suspect  that  the  majority  of  neurons  in  the  cortical  system  at 
any  given  time  are  not  directly  associated  with  awareness! 

The  elevated  firing  activity  of  these  neurons  also,  of  course,  explain  blind- 
sight  and  similar  clinical  phenomena  where  patients  with  cortical  blindness 
can  point  fairly  accurately  to  the  position  of  objects  in  their  blind  visual 
field  (or  detect  motion  or  color)  while  strenuously  denying  that  they  see 
anything  (Weiskrantz  1986;  Storig  and  Cowey  1991). 

We  have  argued  (from  the  experiments  on  binocular  rivalry)  that  the 
firing  of  some  cortical  neurons  does  not  correlate  with  the  percept.  It  is 
conceivable  that  all  cortical  neurons  may  be  capable  of  participating  in  the 
representation  of  one  percept  or  another,  though  not  necessarily  doing  so 
for  all  percepts.  The  secret  of  visual  awareness  would  then  be  the  type  of 
activity  of  a  temporary  subset  of  them,  consisting  of  all  those  cortical  neu¬ 
rons  that  represent  that  particular  percept  at  that  moment.  An  alternative 
hypothesis  is  that  there  are  special  sets  of  "awareness"  neurons  somewhere 
in  cortex  (for  instance,  layer  5  bursting  cells;  see  below).  Awareness  would 
then  result  from  the  activity  of  these  special  neurons. 

In  any  case,  it  is  crucial  to  ask  what  exactly  is  the  detailed  nature  of  the 
particular  type  of  neuronal  activity  giving  rise  to  awareness?  Does  it  in¬ 
volve  oscillatory  activity,  synchronized  firing,  or  bursts  of  high-frequency 
activity?  And  how  can  this  activity  be  decoded? 

THE  NEURONAL  SPIKE  CODE  OF  AWARENESS 

Earlier,  we  postulated  that  awareness  is  mediated  by  a  coordinated  subset 
of  cortical  (and  possibly  also  thalamic)  neurons  firing  in  some  special  man¬ 
ner  for  a  certain  length  of  time.  How  can  this  coordinated  subset  of  cells 
be  formed — and  disbanded — quickly?  Given  that  the  only  rapid  mode  of 
communication  among  cortical  cells  involves  action  potentials,  at  least  four 
possibilities  for  defining  membership  in  this  assembly  using  spikes  come  to 
mind:  high-frequency  activity  (rate  code),  oscillations  in  the  40  Hz  range. 
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bursting,  and  synchronized  firing  activity.  We  do  not  discuss  here  the  later 
idea,  that  all  cortical  neurons  whose  detailed  spike  patterns  are  temporally 
synchronized  to  each  other,  firing  action  potentials  at  about  the  same  time, 
constitute  the  neuronal  assemply  coding  for  awareness,  but  refer  the  inter¬ 
ested  reader  instead  to  chapter  10  by  Singer  on  this  topic  (see  also  Abeles 
1991).  Singer  also  discusses  the  relationship  between  oscillations  in  the  7 
range  and  synchronization. 

It  is  possible,  of  course,  that  the  brain  uses  much  more  complex  spa- 
tiotemporal  neuronal  activity  patterns  than  discussed  here  to  encode  as¬ 
pects  of  awareness.  The  principal  component  analysis  of  spike  trains  in  the 
awake  monkey  (Richmond  and  Optican  1990, 1992)  suggests  such  a  possi¬ 
ble  coding.  We  feel  that  at  this  preliminary  stage  in  our  investigations  it  is 
best  to  first  investigate  the  more  obvious  possibilities. 

When  discussing  these  different  scenarios,  it  is  important  to  keep  in  mind 
that  at  the  psychological  level,  awareness  of  an  event  or  object  appears  to 
involve  attending  to  this  object  and  placing  it  into  short-term  memory.  We 
must  therefore  ultimately  find  a  link  between  the  different  forms  of  con¬ 
stituting  a  neuronal  assembly  and  short-term  or  working  memory  (Crick 
and  Koch  1990a). 

Rate  Coding 

The  simplest  encoding  uses  mean  firing  frequency.  Visual  awareness  of 
objects  is  correlated  with  all  neurons,  say  in  inferior  temporal  (IT)  cortex, 
that  fire  above  a  certain  threshold,  irrespective  of  the  temporal  charac¬ 
ter  of  the  spike  train.  This  assumes  that  all  IT  neurons  corresponding  to 
nonattended  features  are  suppressed,  as  suggested  by  the  attentional  ex¬ 
periment  of  Moran  and  Desimone  (1985)  in  monkey  IT.  In  other  words, 
at  any  given  time  only  those  neurons  whose  features  correspond  to  the 
current  content  of  awareness  are  highly  active,  while  the  firing  of  the  vast 
majority  of  neurons  is  not  significantly  elevated  above  their  background 
rate.  Here  the  binding  problem  is  solved  trivially  by  virtue  of  the  fact  that 
only  those  neurons  corresponding  to  the  attended  event  or  object  are  active 
and  the  problem  of  incorrect  binding  does  not  arise. 

The  beautiful  experiments  carried  out  by  Newsome  and  his  colleagues 
(Newsome  et  al.  1989;  Britten  et  al.  1992)  support  the  notion  that  the 
firing  rate  of  MT  cells  directly  encode  motion  perception.  Analyzing  the 
total  number  of  spikes  discharged  by  a  cell  during  the  2  sec  long  exper¬ 
iment  allows  an  'Ideal"  observer  to  mimic  the  observed  psychophysical 
performance  of  the  animal  to  a  remarkable  extent.  In  these  experiments, 
the  direction  and  speed  of  the  random  dot  motion  stimulus  used  were 
optimized  for  each  cell  recorded  from. 

As  shown  by  the  Logothetis  and  Schall  (1989)  experiment,  only  a  mi¬ 
nority  of  neurons  in  MT  follow  the  changing  percept  associated  with  the 
binocular  rivalrous  stimulus.  Thus,  it  may  well  be  possible  that  awareness 
only  correlates  with  high-frequency  activity  in  parts  of  cortex  further  re- 


98 


Koch  and  Crick 


moved  from  the  sensory  periphery  such  as  V4,  IT,  or  prefrontal  cortex  due 
to  relevant  differences  in  circuitry  or  biophysics  (only  enabling  neurons  in 
these  areas,  for  instance,  to  store  short-term  memory).  In  our  opinion  it  is 
unlikely  that  cells  behave  in  such  an  all-of-none  manner  in  higher  cortical 
areas,  that  is  firing  only  if  they  belong  to  the  neuronal  assembly  encoding 
awareness. 

The  major  computational  disadvantage  of  rate  encoding  for  solving  the 
binding  problem  is  the  associated  loss  in  bandwidth.  Using  a  sort  of  tem¬ 
poral  code  allows  the  superposition  of  information  coded  via  the  firing  fre¬ 
quency  (i.e.,  orientation,  speed,  hue,  etc.),  and  information  coded  in  time, 
here  that  the  neurons  are  part  of  a  particular  assembly  (von  der  Malsburg 
1981;  von  der  Malsburg  and  Schneider  1986). 

Neuronal  Oscillations 

In  our  original  publication  on  this  topic  (Crick  and  Koch  1990a),  we  hy¬ 
pothesized  that  all  neurons  corresponding  to  various  aspects  of  the  object 
the  observer  is  currently  directing  attention  to  fire  in  an  oscillatory  and 
semisynchronous  manner,  binding  them  together.  Chapter  10  by  Singer 
discusses  the  background  and  current  status  of  these  oscillations  (see  also 
Koch  1993).  We  here  briefly  highlight  their  status  with  respect  to  cat  and 
monkey  cortex. 

Oscillations  in  the  Cat  Semisynchronous  oscillations  in  local  field  poten¬ 
tials  as  well  as  multi-  and  single-unit  activity  in  the  30-70  Hz  range  were 
reported  in  the  first  visual  (VI)  area  of  the  lightly  anesthetized  cat  (Eckhom 
et  al.  1988;  Gray  and  Singer  1989;  Gray  et  al.  1989).  These  oscillations  were 
subsequently  shown  to  be  present  in  awake  kittens  and  more  recently  in 
alert  cats. 

The  40-Hz  oscillations  are  most  clearly  seen  in  the  local  field  potentials, 
rather  than  in  individual  neuron,  although  recently,  Jagadeesh  et  al.  (1992) 
using  in  vivo  patch-clamping,  reported  that  the  intracellular  membrane  po¬ 
tential  of  cells  in  cat  visual  cortex  oscillate  in  the  40  Hz  range.  About  10% 
of  simple  cells,  but  more  than  half  of  all  recorded  complex  cells  show  oscil¬ 
lations  with  a  mean  value  around  50  Hz.  The  oscillations  may  last  for  rela¬ 
tively  short  periods  (such  as  200  msec)  and  vary  in  frequency  between  trials, 
so  that  averaging  over  longer  periods  may  make  them  almost  invisible.  The 
oscillations  themselves  are  not  locked  to  the  stimulus.  The  site  of  origin  of 
these  oscillations  is  still  controversial.  Two  intracellular  studies  have  re¬ 
vealed  depolarization-dependent  subthreshold  oscillations  in  cortical  cells 
in  the  40  Hz  range  (Llinas  et  al.  1991;  Nunez  et  al.  1992).  Yet,  strong  50-Hz 
oscillations  have  been  seen  in  cat  geniculate  cells  and  may  originate  already 
in  the  retina  (Ghose  and  Freeman  1992);  furthermore,  subthreshold  oscilla¬ 
tions  in  the  20-40  Hz  band  can  be  evoked  by  intracellular  current  injections 
in  cat  thalamocortical  relay  cells  (Steriade  et  al.  1991).  It  appears  that  40-Hz 
oscillations  can  be  generated  at  a  number  of  sites  in  the  nervous  system. 
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When  recording  from  two  distinct  sites  in  cortex,  the  oscillations  can 
be  phase-locked  with  a  phase-shift  of  ±3  msec  around  the  origin  (Engel 
et  al.  1990;  see  also  chapter  10  by  Singer).  When  the  distance  between 
the  two  sites  is  small,  the  cross-correlation  depends  little  on  the  preferred 
orientation  of  the  units  being  recorded  from.  However,  at  distances  up 
to  10  mm  and  across  the  vertical  midline  (i.e.,  when  recording  from  the 
two  cortical  hemispheres;  Engel  et  al.  1991b)  the  cross-correlation  shows 
a  significant  peak  only  if  the  two  stimuli  are  spatially  aligned  with  each 
other.  It  can  be  shown  that  stimulation  of  the  two  sites  by  an  elongated 
single  bar  leads  to  strong  synchronization  among  the  firing  activities  at  the 
two  sites,  while  stimulation  with  two  separate,  but  still  aligned,  bars  leads 
to  a  reduced  cross-correlogram  (Gray  et  al.  1990).  This  has  yet  to  be  shown 
for  the  alert  animal. 

The  fact  that  the  40-Hz  oscillations  are  seen  in  lightly  anesthetized  cats  is 
not  by  itself  fatal  to  our  theory,  as  the  anesthetic  may  remove  or  reduce  some 
other  essential  aspect  of  visual  awareness,  such  as  short-term  memory. 

Oscillations  in  the  Monkey  There  has,  so  far,  been  much  less  work  on  the 
oscillations  in  monkey  visual  cortex.  Livingstone  (1991)  has  seen  strong 
oscillations  in  both  single-unit  as  well  as  local  field  potentials  in  VI  in  the 
anesthetized  monkey  (see  also  Freeman  and  van  Dijk  1987).  Recording 
from  two  sites  with  two  electrodes,  she  often  observes  phase-locked  oscil¬ 
lations  using  either  a  single  bar  or  two  bars.  Kreiter  and  Singer  (1992)  report 
brief  periods  of  highly  oscillatory  activity  in  multiunit  data  in  the  awake 
and  fixating  monkey  in  area  MT.  Based  on  their  criteria,  58%  of  units  show 
significant  oscillatory  activity  in  the  30-60  Hz  range.  Oscillatory  response 
episodes  are  often  of  short  duration  (<  300  msec),  do  not  occur  on  each  trial, 
and  can  vary  in  their  oscillation  frequency  both  within  and  between  trials 
(Kreiter  and  Singer  1992).  Thus,  averaging  cell  responses  over  long  time 
will  render  the  oscillations  invisible.  Nakamura  et  al.  (1991,  1992)  record 
single  neurons  in  the  temporal  lobe  of  monkey  during  a  short-term  mem¬ 
ory  task.  About  one  quarter  of  their  neurons  show  a  stimulus-dependent, 
sustained  firing  following  the  short  display  of  a  particular  figure.  The  asso¬ 
ciated  autocorrelograms  showed  pronounced  oscillations  whose  frequency 
varied  from  one  stimulus  to  the  next  between  3  and  28  Hz.  However,  the 
majority  of  oscillations  had  frequencies  less  than  8  Hz. 

A  thorough  search  for  oscillatory  neuronal  responses  in  the  monkey  was 
carried  out  by  Young  et  al.  (1992).  The  autocorrelation  function  associated 
with  multiunit  activity  showed  oscillations  in  the  12-13  Hz  {a  range)  at 
about  10%  of  recording  sites  in  areas  VI  and  MT  and  no  oscillations  in  in¬ 
ferior  temporal  cortex  of  the  anesthetized  monkey.  In  the  behaving  monkey 
(performing  a  face  discrimination  task),  only  2  of  50  recording  sites  in  IT 
showed  oscillations  in  the  40-50  Hz  range  (Young  et  al.  1992;  see  also 
Tovee  and  Rolls  1992).  Bair  et  al.  (1994)  analyzed  the  response  of  212  cells 
in  extrastriate  cortex  (area  MT)  in  the  macaque  monkey  while  the  monkey 
performed  a  very  demanding  motion  discrimination  task  (Newsome  et 


100 


Koch  and  Crick 


al.  1989).  Applying  the  same  criteria  as  Kreiter  and  Singer  (1992)  in  their 
study  of  multiunit  activity  in  MT,  Bair  et  al.  (1993)  find  only  a  single  gell 
that  shows  strong  oscillatory  activity.  This  result,  obtained  using  random- 
dot  stimuli,  is  quite  distinct  from  the  high  percentage  of  oscillatory  cells 
reported  to  be  found  in  the  same  cortical  area  when  using  high-contrast 
bar  stimuli  (Kreiter  and  Singer  1992). 

Two  groups  find  oscillations  in  somatosensory  and  motor  regions  of  the 
alert  monkey.  Murthy  and  Fetz  (1992)  report  the  existence  of  large,  25- 
35  Hz,  oscillations  in  both  local  field  potentials  and  single-unit  activity  in 
pre-  and  postcentral  cortex  of  two  awake  rhesus  monkeys.  These  oscilla¬ 
tions  are  synchronized  across  up  to  20  mm.  Sanes  and  Donoghue  (1993) 
record  local  field  potentials  from  up  to  12  sites  in  motor  and  premotor  cor¬ 
tical  areas  while  the  monkey  is  waiting  for  a  visual  cue  to  carry  out  a  hand 
movement.  These  potentials  show  oscillations  in  the  20-50  Hz  range  that 
are  phase-locked  across  different  sites.  Once  the  visual  cue  appears  and  the 
movement  is  executed,  both  the  oscillations  as  well  as  the  synchronization 
of  neuronal  activity  abruptly  ceases. 

Our  original  hypothesis  (Crick  and  Koch  1990a)  was  that  the  phase- 
locked  firing  of  a  set  of  neurons  at  40  Hz  was  the  neural  correlate  of  visual 
awareness.  Such  a  set  would  correspond  to  the  semiglobal  activity  referred 
to  earlier.  So  far  the  experimental  evidence  has  lent  rather  little  support  to 
this  hypothesis,  though  it  may  still  be  true  that  the  40  Hz  oscillations  are 
used  as  part  of  the  processes  leading  up  to  visual  awareness,  such  as  figure- 
ground  segregation,  as  first  suggested  by  Milner  (1974)  and  discussed  at 
length  by  von  der  Malsburg  (1981;  von  der  Malsburg  and  Schneider  1986). 
More  experimental  work  on  the  natural  history  of  the  40-Hz  oscillations, 
where  and  when  they  can  be  evoked,  etc.,  is  urgently  required,  especially 
in  the  alert  macaque  monkey. 

Bursting  and  the  Lower-Layers  Hypothesis 

Although  it  is  possible  that  all  cortical  neurons  can,  at  one  time  or  another, 
be  part  of  the  neural  correlate  of  awareness,  it  is  sensible  to  explore  the  idea 
that  only  a  limited  subset  of  cortical  neurons  has  this  property.  Our  lower- 
layer  hypothesis  states  that  the  neural  correlates  of  visual  awareness  occur 
mainly  in  the  lower  layers  5  and  6  of  the  cortex.  The  input  layer  as  well  as 
neurons  in  the  upper  layers  2  and  3  are  assumed  to  be  mainly  concerned 
with  unconscious  processing.  We  were  led  to  this  hypothesis  for  several 
reasons. 

What  Becomes  Conscious  Cognitive  scientists,  such  as  Johnson-Laird 
(1988),  have  suggested  that  the  content  of  consciousness  consists  of  the 
results  of  neural  computation  while  the  interim  results  associated  with  the 
computations  leading  up  to  these  results  are  themselves  largely  uncon¬ 
scious.  The  only  cortical  layer  that  has  neurons  that  project  right  out  of  the 
cortical  system  (that  is,  neither  to  other  cortical  areas,  nor  to  the  thalamus 
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nor  the  claustrum)  are  in  layer  5.  For  instance,  layer  5  pyramidal  cells  in 
the  early  visual  cortices,  including  VI,  V2,  V3,  and  MT  (Ungerleider  et  al. 
1983)  project  to  the  superficial  layers  of  the  superior  colliculus  and  to  the 
pontine  nuclei.  The  corticospinal  pyramidal  tract,  with  one  milion  axons 
the  largest  descending  fiber  tract  from  the  human  brain,  originates  in  layer 
5  of  primary  motor,  supplementary  motor,  and  premotor  cortical  areas  and 
projects  onto  intemeurons  and  motoneurons  in  the  spinal  cord.  Similarly, 
the  massive  projection  system  linking  virtually  the  entire  neocortex  with 
the  striatum  in  the  monkey  originates  in  layer  5  (Jones  et  al.  1977).  It  could 
be  argued  that  what  the  cortex  sends  elsewhere  in  the  brain  are  likely  to 
be  the  results  of  its  computations. 

Sleep  and  the  Lower  Layers  A  second  reason  that  attracted  our  attention 
to  the  lower  layers  comes  from  the  study  by  Livingstone  and  Hubei  (1981) 
of  the  same  neuron  in  cat  striate  cortex,  both  when  the  animal  was  awake 
and  when  it  was  in  slow-wave  sleep.  Their  main  result  was  that  the  general 
character  of  the  neuronal  responses  was  similar  in  these  two  states,  though 
the  signal-to-noise  ratio  was  improved  in  the  awake  animal.  In  addition, 
some  neurons  fired  more  strongly  when  the  animal  was  awake.  Such  neu¬ 
rons  were  found  predominantly  (but  not  exclusively)  in  the  lower  cortical 
layer.  They  confirmed  this  result  by  using  double-labeled  deoxyglucose 
studies.  These  showed  that  the  average  neuronal  activity  was  greater  in 
the  lower  cortical  layers  of  VI  when  the  animal  was  awake. 

We  would  like  to  note  in  this  context  that  Livingstone  (1991)  reports 
high-frequency  oscillations  only  in  the  superficial  layers  and  in  layer  4,  but 
not  in  layers  5  and  6  of  primary  visual  cortex  in  the  anesthetized  monkey.  It 
would  be  quite  intriguing  to  know  to  what  extent  oscillations  can  be  found 
in  the  deep  layers  of  V2. 

Bursting  Cells  There  exists  a  subclass  of  pyramidal  cells  in  the  lower 
layers  that  behaves  differently  from  other  pyramidal  or  spiny  stellate  cells. 
Intracellular  current  injections  into  cells  in  rodent  slices  of  sensorimotor 
cortex  has  revealed  two  types  of  pyramidal  cells  (McCormick  et  al.  1985; 
Connors  and  Gutnick  1990;  Agmon  and  Connors  1992).  The  majority  of 
these  in  vitro  cells  respond  to  a  sustained,  intracellular  current  by  a  train 
of  action  potentials,  which  adapt  within  50-100  msec  to  a  more  moder¬ 
ate  discharge  rate  ("regular  spiking"  cells;  figure  5.1  A)  and  correspond  to 
pyramidal  cells  throughout  all  layers,  including  layer  5  (figure  5. IB).  A 
second  class  of  neurons  responds  to  the  depolarization  by  generating  a 
short  burst  of  2-4  spikes,  followed  by  a  long  hyperpolarization.  This  cycle 
of  burst  and  hyperpolarization  persists  for  as  long  as  the  current  stimulus 
persists  ("intrinsically  bursting"  cells;  figure  5. 1C).  The  latter  cells  appear 
to  be  confined  (at  least  in  rat  and  guinea  pig  slice)  to  layer  5  (Agmon  and 
Connors  1992).  Bursting  cells  are  quite  large  and  have  apical  dendrites 
that  extend  up  to  layer  1  and  arborize  there  (Chagnac-Amitai  et  al.  1990; 
Larkman  and  Mason  1990;  figure  5.1D).  In  rat  and  cat  visual  cortex,  the 
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Figure  5.1  "Regularly  spiking"  and  "intrinsically  bursting"  pyramidal  cells  from  rodent 
somatosensory  slice.  (A)  The  majority  of  pyramidal  cells  in  all  layers  respond  to  a  rectangular 
current  pulse  injected  at  the  soma  by  a  train  of  adapting  action  potentials.  They  are  found 
throughout  layers  2-6.  (B)  The  morphology  of  a  such  a  "regular  spiking"  biocytin-filled  layer 
5  pyramidal  cell.  Notice  the  sparse  dendritic  tuft  in  layer  1.  Frequently  the  apical  tree  of 
"regularly  spiking"  cells  does  not  extend  past  layers  2/3.  (C)  Burst  cells,  limited  to  layer  5, 
respond  to  the  same  intracellularly  delivered  current  step  with  a  typical  pattern  of  bursts. 
(D)  Their  morphology  reveales  extensive  dendritic  branching  in  layer  1.  Layer  5  bursty  cells 
project  to  subcortical  targets.  (Drawings  are  modified  from  Agmon  and  Connors  1992  and 
B.  W.  Connors,  personal  communication) 


axons  of  these  cells  project  subcortically,  here  to  the  ipsilateral  superior 
colliculus,  while  pyramidal  cells  with  short  dendrites  not  reaching  into 
layer  1  project  into  other  parts  of  cortex  (Hubener  et  al.  1990;  Kasper  et  al. 
1991).  In  guinea  pig  sensorimotor  and  visual  cortex  maintained  in  vitro, 
layer  5  bursting  cells  project  to  the  superior  colliculus  and  to  the  pontine 
nuclei  (Wang  and  McCormick  1993). 

Layer  5  burst-generating  neurons  typically  exhibit  rhythmic  burst  firing 
in  the  frequency  range  of  0.2-10  Hz,  depending  on  the  level  of  somatic  de¬ 
polarization  (Wang  and  McCormick  1992),  while  Silva  et  al.  (1991)  demon¬ 
strated  that  layer  5  slice  neurons  can  generate  brief  periods  of  oscillatory 
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field  potentials  in  the  5-10  Hz  band  in  response  to  activation  of  excitatory 
afferents. 

This  classification  of  pyramidal  cells  into  “regular  spiking"  and  "intrin¬ 
sically  bursting"  cells  has  not  be  carried  out  in  primate  cortex.  How¬ 
ever,  extracellularly  recorded  cells  in  the  awake  monkey  frequently  show 
a  burst-like  pattern  (figure  5.2).  In  their  statistical  analysis  of  firing  prop¬ 
erties  of  neurons  in  area  MT  in  the  behaving  monkey,  Bair  and  colleagues 
(1994)  found  that  two  thirds  of  the  recorded  cells  frequently  fire  in  bursts  of 
2-4  spikes  within  2-6  msec  and  show  a  small  peak  in  the  25-50  Hz  band  in 
the  associated  power  spectrum.  The  amplitude  of  the  peak  of  spectrum  in 
this  frequency  band  relates  directly  to  their  propensity  to  fire  bursts.  The 
statistical  properties  of  this  cell  class  can  be  fitted  by  the  assumption  of 
Poisson-distributed  bursts  with  a  burst-dependent  refractory  period.  The 
remaining  third  of  their  cells  have  an  autocorrelation  function  and  an  in¬ 
terspike  interval  distribution  compatible  with  the  notion  that  spikes  are 
Poisson  distributed  with  a  refractory  period  (figure  5.2).  Cells  are  either  of 
the  bursting  or  of  the  nonbursting  type  and  do  not  change  from  one  type 
to  the  other.  It  is  not  known  whether  these  bursting  cells  correspond  to  the 
intracellularly  defined  "intrinsic  bursters." 

In  their  study  of  the  relationship  between  single  unit  firing  properties 
and  the  behavior  of  the  animal,  Newsome  and  colleagues  (Newsome  et 
al.  1989;  Britten  et  al.  1992)  recorded  from  MT  cells  while  the  monkey 
performed  a  near-threshold  direction-of-motion  discrimination  task.  Us¬ 
ing  signal-detection  theory  (ROC  analysis)  based  on  the  total  number  of 
spikes  occurring  during  the  trial  (here  2  sec),  they  obtain  a  neuronal  thresh¬ 
old  and  compared  it  against  the  more  conventional  psychometric  threshold 
of  the  animal,  finding  that  the  two  are  very  similar.  Bair  et  al.  (1994)  show 
that  the  sensitivity  of  the  ROC  analysis  can  be  improved,  in  a  few  cases 
by  a  factor  two,  if  a  "burst"  of  spikes  is  treated  as  a  single  event,  rather 
than  as  consisting  out  of  a  variable  number  of  single  action  potentials.  This 
argues  for  the  idea  that  "bursts"  are  events  that  are  treated  differently  by 
the  nervous  system  than  isolated  spikes  (see  also  Bonds  1992). 

Bursting  and  Short-Term  Memory  Why  should  the  nervous  system  have 
two  types  of  neurons,  one  signaling  isolated  spikes  and  the  other  predom¬ 
inantly  responding  in  bursts  of  spikes?  Could  these  two  cell  types  convey 
fundamental  different  types  of  information  (Crick  1984)?  One  biophysical 
plausible  argument  is  that  bursting  neurons  are  much  more  efficient  at  ac¬ 
cumulating  calcium  in  their  axonal  terminals  than  cells  that  fire  isolated 
spikes  (that  is,  four  spikes  within  a  10-msec  interval  cause  a  much  larger 
increase  in  intracellular  calcium  at  the  end  of  the  last  spike  than  four  spikes 
arriving  within  a  40-msec  interval).  Because  intracellular  calcium  accumu¬ 
lation  in  the  presynaptic  terminal  is  thought  to  be  mainly  responsible  for 
various  forms  of  short-term  potentiation  (in  particular  facilitation  and  aug¬ 
mentation;  Magleby  1987),  it  may  well  be  that  the  primary  function  of  layer 
5  bursting  neurons  is  to  induce  this  non-Hebbian  (that  is,  nonassociative) 


104 


Koch  and  Crick 


400 


900 


in  min  i  in  i  n,  ,i ii  iij i iiii,iiiiiibi.ijjjiiiiii 

900  400 

Time  (msec) 


24_, 


GO 

O 

$ 


6. 


0. 


I 


|ll^^ 


P<fl|L.UT>w| 


40  80  0 

Interspike  Interval  (msec) 


40 


80 


Figure  5.2  Spike  train  statistics  from  two  extracellularly  recorded  cells  in  cortical  area  MT 
in  the  behaving  monkey  responding  to  randomly  moving  dots.  The  top  row  illustrates  a 
typical  500-msec  segment  of  spike  occurrence  times.  The  distribution  of  intervals  between 
two  consecutive  spikes  (ISI)  is  shown  in  the  middle  row  (note  the  different  scales).  The  power 
spectra  of  the  two  cells  are  shown  in  the  bottom  row.  Both  cells  fire  at  roughly  similar  average 
rates  (40-60  spikes /sec).  One-third  of  all  cells  are  of  the  nonbursty  type,  with  an  ISI  and  a 
power  spectrum  expected  from  a  Poisson  process  with  a  refractory  period.  Statistics  from 
one  such  cell  are  shown  in  the  right  column  (averaged  over  30  trials).  About  two-thirds  of 
cells  are  of  the  bursty  type,  with  a  significant  fraction — if  not  the  majority — of  the  interspike 
intervals  falling  into  the  first  three  bins  of  the  ISI  and  a  peak  in  the  power  spectrum  in  the 
20-60  Hz  range.  Statistics  from  such  a  bursty  cell,  based  on  15  trials,  are  shown  in  the  left 
column.  The  relationship — if  any — between  these  bursting  and  the  "intrinsically  bursting" 
pyramidal  cells  from  slices  (see  figure  5.1)  is  not  known.  (Drawings  are  modified  from  Bair 
etal.  1993) 


type  of  synaptic  plasticity  at  its  postsynaptic  targets  outside  of  the  cortical 
system.  Recurrent  spiking  activity  within  one  or  several  seconds  at  these 
synapses  will  then  cause  a  greater  postsynaptic  signal  than  without  the 
"priming"  by  the  previous  burst.  In  essence,  the  burst  of  spikes  acts  to 
turn  on  short-term  memory,  which  then  decays  over  several  seconds  (an 
interesting  theoretical  question  is  whether  or  not  very  short-term  memory 
needs  to  be  associative). 

Thus  a  very  simplistic  answer  to  the  question  "Which  neurons  fire  in 
such  a  way  that  they  correlate  with  awareness?"  would  be  "The  large 
pyramidal  cells  in  layer  5  that  fire  in  bursts  and  project  outside  the  cortical 
system!"  It  would  be  marvelous  if  this  were  true  but  the  answer  is  unlikely 
to  be  as  simple  as  that. 
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The  Lower  Layers  and  the  Thalamus  Pyramidal  cells  in  both  layer  5  and 
layer  6  project  to  the  various  thalamic  nuclei,  including  the  lateral  genic¬ 
ulate  nucleus  (LGN),  as  well  as  the  inferior,  lateral,  and  medial  pulvinar 
nuclei.  In  monkey,  only  layer  6  of  area  VI  projects  back  to  the  LGN,  while 
higher  cortical  areas  project  to — and  receive  from — the  different  pulvinar 
nuclei.  In  primate  area  VI,  cells  in  layer  5  as  well  as  the  deep  part  of  layer 
6  project  to  the  pulvinar  (Conley  and  Raczkowski  1990),  while  higher  cor¬ 
tical  areas  project  from  the  deep  layers  into  the  different  pulvinar  nuclei 
(with  the  general  rule  that  as  one  goes  from  occipital  lobe  to  more  anterior 
cortical  areas,  the  thalamic  target  areas  move  from  inferior  to  lateral  to  me¬ 
dial  pulvinar).  The  precise  layer  of  origin  of  this  corticothalamic  projection 
is  not  know. 

In  cat,  about  half  of  all  pyramidal  cells  in  layer  6  project  back  to  the  LGN 
while  others  project  to  the  claustrum.  This  corticogeniculate  projection 
is  so  massive  that  at  least  10  times  more  fibers  project  down  than  project 
from  the  LGN  in  VI  (Sherman  and  Koch  1986).  It  is  known  in  cat  that 
the  propagation  delay  of  these  cells  is  unusually  long  and  heterogeneous, 
ranging  from  2  to  20  or  more  msec  (Tsumoto  et  al.  1978),  in  agreement  with 
their  unmyelinated  nature.  These  could  conceivably  form  reverberating 
circuits  that  hold  activity  in  very  short-term  memory.  Furthermore,  the 
circuit  LGN  — ►  layer  6  — ♦  LGN  is  composed  of  neurons  whose  axons  have 
very  few  horizontal  collaterals.  This  may  prevent  the  //reverberation//  from 
spreading  too  easily  to  adjacent  neurons.  Under  an  anesthetic  such  possible 
reverberations  may  be  too  weak  to  become  established  (incidentally,  almost 
all  work  on  the  function  of  the  corticogeniculate  pathway  has  been  done 
only  on  anesthetized  animals  and  therefore  its  main  function  may  have 
been  missed;  Sherman  and  Koch  1986). 

A  final  observation  of  possible  relevance  to  the  distinction  between  lower 
and  upper  layers  is  that  visual-induced  activity  can  be  blocked  by  NMDA 
antagonists  in  the  superficial,  but  not  in  the  deep  layers  of  cat  visual  cortex 
(Fox  et  al.  1989, 1990).  Iontophoretic  and  radioligand  binding  studies  from 
a  number  of  different  labs  argue  for  high  densities  of  NMDA  receptors  in 
superficial  and  low  densities  in  the  input  and  the  deep  layers  of  cortex 
(summarized  in  Tsumoto  1990). 

HYPOTHESES  ABOUT  CONNECTIONS 

So  much  for  the  lower  cortical  layers.  Are  there  any  signs  of  connections 
between  cortical  areas  that  might  relate  to  visual  awareness?  We  have  been 
able  to  think  of  three  different  clues. 

1.  Uses  of  content  of  " consciousness ."  The  first  asks  the  question.  What  is 
conscious  information  used  for?  In  neuronal  terms  one  might  expect  that 
the  information  is  not  only  exported  subcortically  but  is  also  made  avail¬ 
able  to  at  least  the  hippocampal  system  and  the  higher,  planning  levels  of 
the  motor  system,  probably  located  in  or  near  the  anterior  portion  of  the 
cingulate  sulcus.  We  will  not  describe  these  connections  in  detail  as  so  far 
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we  have  not  been  able  to  turn  up  any  neuroanatomical  data  that  might 
help.  For  example,  exactly  which  neurons  project  from  the  higher  levels  of 
the  visual  system  to  those  to  the  motor  system?  We  will  not  detail  this  idea 
further  though  we  shall  continue  to  keep  an  eye  on  it. 

2.  Visual  processing  and  backprojections.  The  second  idea  was  put  forward 
some  time  ago  in  a  general  way  by  Milner  (1974).  He  proposed  that  an 
essential  feature  of  visual  processing  would  turn  out  to  be  the  back  pro¬ 
jections  to  VI  (or  conceivably  to  VI  and/or  V2)  as  these  areas  are  the  only 
ones  with  detailed  information  about  precise  visual  location.  Supporting 
evidence  comes  from  a  study  of  the  somatosensory  system  that  uses  a  com¬ 
bination  of  current  source  analysis  and  somatosensory-evoked  potentials 
in  the  awake  monkey  (Cauller  and  Kulics  1991a,b).  These  authors  argue 
that  the  backward  projections  from  S2  to  SI  (that  are  targeted  specifically 
to  the  superficial  layers  1  and  2)  are  involved  in  the  conscious  process  of 
touch  sensation  as  measured  by  the  evoked  potentials. 

Looking  at  the  diagram  of  Felleman  and  Van  Essen  (1991)  for  the  connec¬ 
tions  between  cortical  areas  in  the  visual  system  of  the  macaque  monkey, 
it  would  appear  that  while  areas  MT  and  V4  do  send  projections  back  to 
VI,  the  inferotemporal  regions  do  not.  However,  more  recent  evidence 
(K.  Rockland,  personal  communication)  suggests  that  projections  between 
cortical  areas  may  frequently  be  nonreciprocal.  While  the  inferotemporal 
regions  do  not  receive  a  direct  projection  from  VI,  Rockland  claims  that 
they  do  send  backprojections  to  VI.  Of  course  there  may  be  several  kinds 
of  backprojections.  Clearly  this  idea  of  Milner's  needs  to  be  kept  in  mind 
as  further  results  come  in. 

3.  Reentrant  projections.  The  third  idea  about  connections  is  based  on 
"reentrant"  or  backward  connections,  so  strongly  emphasized  by  Edel- 
man  (1987).  The  basic  idea  is  that  consciousness  is,  in  some  sense,  the 
brain  reflecting  on  itself  and  that  this  needs  reentrant  connections — ,that 
is,  connections  that  form  a  circuit  that  finishes  where  it  began. 

An  obvious  case  of  massive  reentrant  connections  is  those  formed  by  the 
hippocampal  system  in  the  medial  temporal  lobe.  Its  input  comes  mainly 
from  the  entorhinal  cortex  and  its  output  returns  there,  though  to  a  differ¬ 
ent  cortical  layer.  However,  as  we  know  from  such  patients  as  H.M.  and 
Boswell,  the  complete  removal  of  this  system  by  certain  types  of  brain  dam¬ 
age  in  humans  does  not  affect  immediate  awareness,  though  the  patient 
cannot  remember  any  recent  incident  that  took  place  more  than  a  minute 
or  so  before  (Squire  and  Zola-Morgan  1991).  There  are  very  many  other 
reentrant  connections  in  the  brain,  so  it  is  not  easy  to  discover  which  ones 
might  be  intimately  involved  in  visual  awareness.  Are  there  any  that  have 
some  unusual  character? 

DIFFERENT  FORMS  OF  AWARENESS 

Before  closing,  let  us  return  to  a  topic  we  discussed  only  tangentially.  In 
our  original  manuscript  (Crick  and  Koch  1990a),  we  postulated  the  ex- 
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istence  of  two  forms  of  awareness:  a  brief  and  very  transient  one  and  a 
form  associated  with  selective,  visual  attention.  It  is  the  presence  of  the 
latter,  coupled  with  short-term  memory,  that  we  believe  mediates  vivid 
awareness.  However,  in  the  absence  of  the  former,  our  visual  environment 
would  have  the  appearance  of  a  tunnel,  in  which  the  currently  attended 
location  appears  in  vivid  detail  with  its  associated  perceptual  attributes 
while  everything  else  is  invisible  or  hazy.  We  were  therefore  led  to  postu¬ 
late  another  form  of  fleeting  awareness,  enough  to  mediate  the  perceptual 
richness  we  take  for  granted  when  looking  at  the  world. 

Due  to  its  lack  of  an  attentional  mechanism,  we  believe  that  fleeting 
awareness  is  not  associated  with  solving  the  transient  type  of  binding  prob¬ 
lem  but  to  encode  only  perceptual  features  that  are  bound  within  single 
neurons  due  to  epigenetic  factors  or  overlearning. 

However,  given  the  complexities  of  the  brain,  it  may  well  be  possible 
that  many  more  different  forms  of  awareness  coexist  at  all  times,  each 
with  different  functional  abilities.  For  instance,  it  may  well  be  that  each 
corticothalamic-cortical  loop  instantiates  it  own  form  of  short-term  mem¬ 
ory  and  awareness,  each  with  its  own  representation  and  time  scale. 

Recent  psychophysical  experiments  by  Braun  and  Sagi  (1992)  and  Braun 
(1994)  support  this  point  of  view.  The  visual  attention  of  subjects  was  "tied 
down"  to  a  particular  spot  on  a  monitor  by  asking  them  to  carry  out  a 
difficult  discrimination  experiment.  With  focal  attention  thus  distracted, 
Braun  projected  a  number  of  objects  in  an  annulus  around  the  focus  of 
attention  and  asked  subjects  to  respond  if  an  odd-man-out  was  present 
in  these  displays.  His  results  show  that  subjects  could  well  detect  a  large 
object  among  many  small  objects,  a  red  among  many  grays,  a  low-spatial 
frequency  grating  among  many  high-frequency  ones,  a  circle  among  many 
triangles,  or  a  triangle  among  many  distracting  circles.  However,  subjects 
were  frequently  unable  to  detect  a  small  object  among  large  ones,  a  gray 
object  among  many  distracting  red  ones,  etc.  In  the  absence  of  attention, 
subjects  could  even  reliably  report  the  hue  of  two  bright  spots,  one  above 
and  one  below  the  location  of  attention.  Thus,  in  this  case,  awareness  of 
these  objects  is  mediated  in  the  absence  of  focal  attention,  supporting  the 
idea  that  different  forms  of  "vivid  awareness"  might  exist,  each  one  with 
its  own  set  of  properties. 

It  is  important  that  neurobiological  theories  of  these  phenomena  do  not 
treat  short-term  memory,  visual  perception,  or  awareness  as  single,  mono¬ 
lithic  entities  with  but  a  single  neuronal  implementation.  They  may  rather 
be  the  end  product  of  a  large  number  of  highly  interactive  neuronal  mech¬ 
anisms  (Minsky  1985). 

A  SUMMARY  OF  OUR  SPECULATIONS 

To  assist  the  reader  we  now  list  very  briefly  the  various  speculative  ideas 
put  forward  in  this  chapter.  These  speculations  do  not  all  make  a  coherent 
set,  though  certain  combinations  of  them  do. 
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1.  The  brain  constructs  an  explicit,  multilevel,  symbolic  interpretation  of 
parts  of  its  environment. 

1.1.  To  do  this  it  usually  needs  some  form  of  attentional  mechanism. 

2.  The  form  of  awareness  associated  with  focal  attention  is  caused  by  the 
firing  of  a  temporally  coordinated  assembly  of  neurons  firing  in  some  spe¬ 
cial  manner  for  at  least  100  or  200  msec. 

2.1.  This  special  form  of  neuronal  activity  induces  short-term  memory. 

3.  If  neurons  are  not  part  of  this  transient  subset,  they  can  still  influence 
behavior  but  do  not  contribute  toward  awareness. 

3.1.  Underlying  every  direct  perception  is  a  group  of  neurons  strongly 
firing  and  participating  in  the  temporally  coordinated  neuronal  assembly. 

4.  Semisynchronous,  neuronal  oscillations  in  the  25-55  Hz  band  could 
cause  neurons  to  be  coordinated,  giving  rise  to  short-term  memory  and 
thus  to  awareness. 

5.  The  neural  correlate  of  awareness  occurs  mainly  in  the  lower  layers. 

5.1.  The  neural  correlate  of  awareness  is  associated  with  the  bursting  neu¬ 
rons  in  layer  5,  some  of  which  project  outside  the  cortical  system. 

5.2.  The  loop  between  deep  layers  in  cortex,  the  different  thalamic  nuclei, 
and  back  to  cortex  may  implement  short-term  memory. 

5.3.  The  neurons  in  the  upper  cortical  layers  are  mainly  concerned  with 
unconscious  processing. 

6.  Various  types  of  neural  connections  may  be  associated  with  some  forms 
of  visual  awareness.  Possible  examples  are: 

6.1.  Connections  to  the  hippocampal  system  and  the  higher  planning  levels 
of  the  motor  system,  direct  backprojections  to  VI  (and  possibly  V2),  and 
reentrant  connections  within  layer  4  or  between  cortical  areas  at  the  same 
level  in  the  anatomical  hierarchy. 
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Perception  as  an  Oneiric-like  State 
Modulated  by  the  Senses 

Rodolfo  R.  Llinas  and  Urs  Ribary 


An  issue  clearly  fundamental  to  the  understanding  of  central  nervous  sys¬ 
tem  function  lies  in  the  similarities  and  differences  between  wakefulness 
and  dreaming.  Indeed,  from  the  standpoint  of  the  thalamocortical  system 
it  has  been  shown  that,  as  will  be  described  in  detail  later,  these  two  states 
have  a  common  intrinsic  implementation  mechanism  and  so  may  be  con¬ 
sidered,  in  that  sense,  as  fundamentally  equivalent  (Llinas  and  Pare  1991). 
The  implications  of  such  a  hypothesis  could  be  far-reaching  if  wakefulness 
is  demonstrated  to  be,  as  is  dreaming,  a  closed  intrinsic  functional  state. 
If  this  were  the  case,  the  central  difference  between  the  two  would  be  the 
degree  of  their  modulation  by  sensory  input. 

WAKEFULNESS  AND  REM  SLEEP 

Paradoxical  sleep  is  characterized  by  the  repeated  occurrence  of  rapid  eye 
movement  (REM) — from  which  the  alternative  designation  "REM  sleep" 
was  derived — and  by  muscular  atonia.  One  of  the  most  salient  differences 
between  the  wakefulness  and  dreaming  states  resides  in  the  fact  that  sen¬ 
sory  input  does  not  generate  the  expected  cognitive  consequences  that  it 
does  in  the  awake  state.  With  respect  to  other  sleep  states,  REM  sleep  dif¬ 
fers  in  that  sensory  thresholds  for  awakening  are  the  highest  in  REM  sleep, 
except  for  stage  IV  (Rechtschaffen  et  al.  1966;  Williams  et  al.  1964),  and  that 
subjects  awakened  during  REM  sleep  often  report  having  been  dreaming. 

Of  central  interest  here  is  the  finding  that  the  averaged  evoked  poten¬ 
tials  (AEPs)  recorded  from  the  scalp  in  response  to  sensory  stimulation 
during  waking  and  REM  sleep  are  very  similar,  but  they  differ  strikingly 
from  those  recorded  during  non-REM  sleep.  For  instance,  the  early  com¬ 
ponents  of  the  auditory  evoked  potential  in  humans  (<10  msec)  (Moller 
and  Burgess  1986)  do  not  display  state-dependent  fluctuations  during  the 
sleep-waking  cycle  (Campbell  and  Bartoli  1986;  Giard  et  al.  1988;  Picton 
and  Hillyard  1974).  However,  those  middle-latency  components  (10-80 
msec)  that  seem  to  reflect  early  thalamocortical  activity  decreased  in  am¬ 
plitude  from  waking  to  stage  IV  but  returned  to  normal  (Chen  and  Buch- 
wald  1986)  or  surpassed  waking  values  in  REM  sleep  (Deiber  et  al.  1989; 
Mendel  and  Goldstein  1971;  Mendel  and  Kuperman  1974). 


Likewise,  short-,  middle-,  and  long-latency  components  may  also  be 
distinguished  in  somatosensory  evoked  potentials.  Among  the  early  com¬ 
ponents,  only  the  positivity  at  15  msec  ( P  ~  15)  does  not  display  state- 
dependent  fluctuations  (Yamada  et  al.  1988).  The  amplitude  of  the  other 
components  decreases  markedly  from  waking  to  stage  IV  but  partially 
recovered  in  REM  sleep  (Yamada  et  al.  1988).  The  latency  of  the  P20 
component  (which  presumably  reflects  the  primary  cortical  response)  in¬ 
creases  from  waking  to  stage  IV  but  returned  close  to  waking  values  in 
REM  sleep. 

The  Central  Paradox 

Since  the  brain's  response  to  sensory  stimulation  is  very  similar  during 
REM  sleep  and  wakefulness,  the  threshold  for  awakening  should  be  lowest 
in  REM  sleep.  As  stated  above,  however,  this  is  not  the  case  in  humans  or 
in  other  mammals  where  the  auditory  threshold  for  awakening  is  clearly 
higher  in  REM  sleep  than  in  deep  slow-wave  sleep  (Jouvet  and  Michel 
1959).  These  studies  point  to  a  central  paradox  of  REM  sleep:  stimuli  that 
are  perceived  in  the  waking  state  do  not  awaken  subjects  in  REM  sleep 
even  though  the  amplitude  of  the  primary  evoked  cortical  responses  is 
generally  similar  to,  or  higher  than,  those  in  the  waking  state.  In  other 
words,  although  the  thalamocortical  network  is  as  excitable  during  REM 
sleep  as  in  the  waking  state,  the  input  is  mostly  ignored. 

The  resolution  of  this  paradox  probably  lies  in  the  nature  of  brain  func¬ 
tion  in  a  most  fundamental  sense.  In  particular,  the  fact  that  the  late  po¬ 
tentials  (PI 00,  P200,  P300)  following  sensory  stimuli  are  abolished  in  REM 
sleep  (Goff  et  al.  1966;  Velasco  et  al.  1980)  suggests  that  the  ongoing  activity 
that  generates  cognition  during  dreaming  prevents  the  early  thalamocor¬ 
tical  activation  from  being  incorporated  into  the  intrinsic  cognitive  world. 
Perhaps  then,  an  altered  state  of  attention  is  the  most  likely  origin  for  the 
high  threshold  for  awakening  from  REM  sleep  (Llin^s  and  Pare  1991). 

Is  "Cognition"  during  REM  Sleep  Similar  to  That  in  Wakefulness? 

One  tool  available  to  study  the  functional  state  of  the  brain  during  REM 
sleep  is  a  comparison  of  the  dreams  of  control  subjects  with  those  of  patients 
suffering  from  various  central  or  peripheral  nervous  system  dysfunctions. 

The  decline  of  higher  cognitive  abilities  following  circumscribed  lesions 
of  the  temporal  and  parietal  associative  areas  is  also  reflected  in  dream  con¬ 
tent.  For  instance,  patients  afflicted  with  unilateral  neglect  resulting  from 
right  parietal  lobe  damage,  in  which  the  opposite  half  of  the  visual  field 
is  not  perceived,  report  similar  lack  of  perception  in  their  dreams  (Sacks 
1991;  M.  Mesulam,  personal  communication).  Similarly,  people  inhabiting 
the  dreams  of  prosopagnosic  subjects  are  faceless  (A.  Damasio,  personal 
communication).  Interestingly,  when  awake  these  patients  perceive  facial 
features  but  they  cannot  use  such  features  to  recognize  individual  faces. 
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These  observations  indicate  that  mentation  during  dreaming  operates  on 
the  same  anatomical  substrate  as  does  perception  during  the  waking  state. 

From  the  fact  that  similar  deficits  are  formed  in  wakefulness  and  dream¬ 
ing,  it  may  be  concluded  that  a  possible  approach  to  understanding  the 
nature  of  wakefulness  is  to  consider  it  as  one  element  in  a  category  of 
intrinsic  brain  functions,  in  which  REM  sleep  is  another  element.  The  dif¬ 
ference  between  these  two  states  would  be  that  in  REM  sleep,  the  sensory 
specification  of  the  functionalities  carried  out  by  the  brain  is  fundamentally 
altered.  That  is,  REM  sleep  can  be  considered  as  an  intrinsic  state  in  which 
"attention"  is  turned  away  from  sensory  input  (Llin^s  and  Pare  1991). 

In  proposing  that  wakefulness  is  nothing  other  than  a  dreamlike  state 
modulated  by  the  presence  of  specific  sensory  inputs  (Llinas  and  Pare 
1991),  the  following  must  be  considered.  The  thalamus  is  classically  re¬ 
garded  as  the  functional  and  morphological  gate  to  the  forebrain  (Steriade 
et  al.  1990).  Indeed,  with  the  exception  of  the  olfactory  system,  all  sensory 
messages  reach  the  cerebral  cortex  through  the  thalamus  (Jones  1985).  Yet, 
synapses  established  by  specific  thalamocortical  fibers  comprise  a  minor¬ 
ity  of  cortical  contacts.  For  example,  in  the  primary  somatosensory  and 
visual  cortices,  the  axons  of  ventroposterior  thalamic  and  dorsal  LGN  neu¬ 
rons  account  for,  respectively,  28%  and  20%  of  the  synapses  in  layer  IV  and 
adjacent  parts  of  layer  III  (LeVay  and  Gilbert  1976),  where  most  thalam¬ 
ocortical  axons  project.  Even  in  primary  sensory  cortical  areas,  most  of 
the  connectivity  does  not  represent  sensory  input  transmitted  by  the  thala¬ 
mus,  but  rather  input  from  cortical  and  nonthalamic  CNS  nuclei.  Indeed, 
corticostriatal,  corticocortical  and  corticothalamic  pyramidal  neurons  re¬ 
ceive,  respectively,  0.3-0.9%,  1. 5-6.8%,  and  6.7-20%  of  their  synapses  from 
specific  thalamocortical  fibers,  and  less  than  4%  of  the  synaptic  contacts 
on  multipolar  aspiny  neurons  in  layer  IV  originate  in  the  thalamus  (White 
and  Hersch  1981, 1982). 

Moreover,  the  connectivity  between  the  thalamus  and  the  cortex  is  bidi¬ 
rectional.  Indeed,  layer  6  pyramidal  cells  project  back  to  that  area  of  the 
thalamus  where  their  specific  input  arises  (Jones  1984),  and  layer  5  cells 
project  to  the  nonspecific  thalamus.  The  number  of  corticothalamic  fibers 
is  about  one  order  of  magnitude  larger  than  the  number  of  thalamocortical 
axons  (Wilson  et  al.  1984).  Looking  at  the  peripheral  input,  the  number  of 
optic  nerve  axons  projecting  to  the  LGN  is  much  smaller  than  the  number 
of  corticothalamic  axons  projecting  to  the  LGN  (Wilson  et  al.  1984). 

Clearly,  the  sensory  input  arising  from  the  thalamus  is  necessary  for 
perception;  in  the  absence  of  specific  inputs,  there  is  no  externally  guided 
sensory  function.  However,  the  specific  thalamocortical  input  accounts  for 
a  minority  of  the  synaptic  contacts  in  the  cortex. 

Let  us  briefly  discuss  the  nature  of  the  interaction  between  this  set  of 
innate  mechanisms  and  the  sensory  world.  At  the  outset,  it  must  be  rec¬ 
ognized  that  sensory  events  are  nothing  other  than  simplifications  deter¬ 
mined  by  the  physical  properties  of  our  sensory  organs.  Similarly,  the  in¬ 
ternal  representation  derived  from  the  sensory  specification  is  constrained 
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by  the  "computational"  capabilities  of  the  brain.  According  to  this  view, 
the  model  of  the  world  emerging  during  ontogeny  is  governed  by  innate 
predispositions  of  the  brain  to  categorize  and  integrate  the  sensory  world  in 
certain  ways.  Although  the  particular  computational  world  model  derived 
by  a  given  individual  is  a  function  of  his  sensory  exposure,  the  resulting 
functional  accommodation  is  genetically  determined.  As  a  result,  sensory 
inputs  presented  during  adult  life  would  convey  only  the  parameters  re¬ 
quired  to  specify  the  dimensions  relevant  to  the  cognitive  domains  which 
stemmed  from  this  evolutionary  process.  These  cognitive  domains  could 
be  used  to  recreate  world-analogues  during  dreaming  or,  once  specified  by 
sensory  inputs,  to  generate  an  adaptive  representation  of  the  environment. 

Thus,  we  may  consider  a  closely  related  problem,  that  of  the  open  (ex¬ 
trinsic)  or  closed  (intrinsic)  nature  of  nervous  system  function.  One  view 
stipulates  that  the  brain  states  that  represent  the  external  world  are  point- 
to-point  representations,  having  as  their  basic  currency  a  set  of  elaborate 
reflexes.  This  view  may  be  traced  back,  in  modem  times,  to  William  James 
(1890).  An  opposite  point  of  view  is  that  the  brain  is  basically  a  recurrent  or 
closed  system.  Support  for  the  latter  proposal  comes  from  electrophysio- 
logical  studies  indicating  that  the  intrinsic  membrane  properties  of  neurons 
allow  them  to  oscillate  or  resonate  at  different  frequencies  (Llin£s  1988a) 
and  that  such  intrinsic  activity,  by  supporting  rhythmic  oscillatory  events, 
may  play  a  fundamental  role  in  CNS  function  (Llin£s  1988a;  see  chapter  10 
by  Singer).  It  can  be  argued  that  the  insertion  of  such  elements  into  com¬ 
plex  synaptic  networks  allows  the  brain  to  generate  dynamic  oscillatory 
states  that  deeply  influence  the  brain  activity  evoked  by  sensory  stimuli. 

PERCEPTION  AS  GENERATED  BY  A  CLOSED  SYSTEM 

Several  factors  suggest  that  the  brain  is  essentially  a  closed  system  capable 
of  self-generated  oscillatory  activity  that  determines  the  functionality  of 
events  specified  by  the  sensory  stimuli.  First,  as  stated  above,  only  a  minor 
part  of  the  thalamocortical  connectivity  is  devoted  to  the  reception  and 
transfer  of  sensory  input.  Second,  the  number  of  cortical  fibers  projecting 
to  the  specific  thalamic  nuclei  is  much  larger  than  the  number  of  fibers 
conveying  the  sensory  information  to  the  thalamus  (Wilson  et  al.  1984). 
Thus,  a  large  part  of  the  thalamocortical  connectivity  is  organized  in  what  is 
presently  known  as  reentrant  activity  (Edelman  1987)  or  previously  viewed 
as  reverberating  activity  (Lorente  de  No  1932).  Third,  the  insertion  of 
neurons  with  intrinsic  oscillatory  capabilities  into  this  complex  synaptic 
network  allows  the  brain  to  generate  dynamic  oscillatory  states  which 
shape  the  computational  events  evoked  by  sensory  stimuli.  In  this  context, 
functional  states  such  as  wakefulness  (or  REM  sleep  and  other  sleep  stages) 
appear  to  be  particular  examples  of  the  multiple  variations  provided  by 
the  self-generated  brain  activity. 

Much  neuropsychological  evidence  also  supports  this  view  of  the  brain 
as  a  closed  system  in  which  sensory  input  plays  an  extraordinarily  im- 
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portant  but,  nevertheless,  mainly  modulatory  role.  The  cases  of  prosopag- 
nosic  patients  dreaming  of  faceless  characters  indicate  that  the  significance 
of  sensory  cues  is  largely  dependent  on  their  incorporation  into  larger  cog¬ 
nitive  entities  and  on  the  functional  state  of  the  brain.  In  other  words, 
sensory  cues  gain  their  significance  by  virtue  of  triggering  a  preexisting 
disposition  of  the  brain  to  be  active  in  a  particular  way. 

That  for  the  most  part  connectivities  present  at  birth  in  humans  are  mod¬ 
ified  only  in  detail  during  normal  maturation  has  been  suspected  from  the 
inception  of  neurological  research  (Cajal  1929;  Harris  1987).  The  localiza¬ 
tion  of  function  in  the  brain  began  with  the  identification  of  a  cortical  speech 
center  by  Broca  and  was  followed  by  the  discovery  of  point-to-point  soma- 
totopic  maps  in  the  motor  and  sensory  cortices  (Penfield  and  Rasmussen 
1950),  and  in  the  thalamus  (Mountcastle  and  Hennemann  1949, 1952). 

A  totally  different  type  of  functional  geometry  (Pellionisz  and  Llinas 
1982)  suggests  the  existence  of  temporal  mapping.  This  has  been  far  more 
difficult  to  conceptualize,  since  its  study  requires  an  understanding  of  si¬ 
multaneity  in  brain  function  not  usually  considered  in  neuroscience. 

40-HZ  ACTIVITY  AND  COGNITIVE  CONJUNCTION:  THE  CASE  FOR 
TEMPORAL  MAPPING 

Synchronous  activation  has  recently  been  seen  in  the  mammalian  cerebral 
cortex.  Visual  stimulation  with  light  bars  of  optimal  dimensions,  orienta¬ 
tion,  and  velocity  may  synchronously  activate  cells  in  a  given  column  in 
the  visual  cortex  (Eckhom  et  al.  1988;  Gray  et  al.  1989;  Gray  and  Singer 
1989).  Moreover  the  components  of  a  visual  stimulus  that  relate  to  a  singu¬ 
lar  cognitive  object  (such  as  a  line  in  a  visual  field)  produce  coherent  40-Hz 
oscillations  in  regions  of  the  cortex  that  may  be  separated  by  as  much  as 
7  mm  (Gray  et  al.  1989;  Gray  and  Singer  1989).  And  a  high  correlation 
coefficient  has  been  found  for  40-Hz  oscillatory  activity  between  related 
cortical  columns. 

These  findings  have  inspired  a  number  of  theoretical  papers  with  the 
view  that  temporal  mapping  is  very  important  in  nervous  system  function. 
The  central  tenet  can  be  summarized  simply.  Spatial  mapping  allows  a 
limited  number  of  possible  representations.  However,  the  addition  of  a 
second  component  (serving  to  form  transient  functional  states  by  means  of 
simultaneity)  generates  an  indefinitely  large  number  of  functional  states, 
as  the  categorization  is  accomplished  by  the  conjunction  of  spatial  and 
temporal  mapping. 

Magnetoencephalographic  recordings  performed  in  awake  humans 
(Llinas  and  Ribary  1992)  revealed  the  presence  of  continuous  and  coher¬ 
ent  40-Hz  oscillations  over  the  entire  cortical  mantle.  The  presentation  of 
auditory  stimuli  produced  a  clear  resetting  of  this  40-Hz  activity.  Phase 
comparison  of  the  oscillatory  activity  recorded  from  different  cortical  re¬ 
gions  revealed  the  presence  of  a  12-  to  13-msec  phase  shift  between  the 
rostral  and  caudal  pole  of  the  brain  (Llinas  and  Ribary  1992). 
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The  high  degree  of  spatial  organization  displayed  by  this  40-Hz  oscilla¬ 
tion  suggests  that  it  may  be  a  candidate  mechanism  for  the  production  of 
temporal  conjunction  of  rhythmic  activity  over  a  large  ensemble  of  neu¬ 
rons.  Furthermore,  it  has  been  shown  that  the  sparsely  spiny  layer  IV  neu¬ 
rons  of  the  cortex  (Llin3s  et  al.  1991)  are  capable  of  40-Hz  activity  (figure 
6.1).  This  inhibitory  input  would  produce  rebound  sequences  (probably 
dependent  on  persistent  sodium  conductances)  in  thalamically  projecting 
pyramidal  neurons.  These  cells  would  then  generate  a  40-Hz  inhibitory 
rebound  oscillation  in  cells  of  the  reticularis  (RE)  thalamic  nucleus,  a  group 
of  GABAergic  neurons  projecting  to  most  relay  nuclei  of  the  thalamus  (Ste- 
riade  et  al.  1984).  More  recently,  it  has  been  demonstrated  that,  in  addition 
to  oscillations  due  to  cortical  circuit  properties,  thalamic  neurons  in  vivo 
can  also  oscillate  intrinsically  at  40-Hz,  using  ionic  mechanisms  similar  to 
those  of  the  spiny-layer  neurons  (Steriade  et  al.  1991).  Consequently,  spe- 
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Figure  6. 1  In  vitro  intracellular  recording  from  a  sparsely  spinous  neuron  of  the  fourth  layer 
of  the  frontal  cortex  of  guinea  pig.  (A)  The  characteristic  response  obtained  in  the  cell  follow¬ 
ing  direct  depolarization,  consisting  of  a  sustained  subthreshold  oscillatory  activity  on  which 
single  spikes  can  be  observed.  The  intrinsic  oscillatory  frequency  was  42  Hz,  as  demonstrated 
by  the  autocorrelogram  shown  in  the  upper  right  comer.  ( B )  The  same  record  as  in  (/l)  but  at 
slower  sweep  speed,  demonstrating  how  the  response  outlasts  the  first  stimuli  but  comes  to 
an  abrupt  cessation  in  the  middle  of  a  second  stimulus.  (Modified  from  Llintis  et  al.  1991) 


116 


Llin^s  and  Ribary 


cific  corticothalamocortical  pathways  could  be  led  to  resonant  oscillation  at 
40-Hz.  According  to  this  hypothesis,  RE  cells  would  be  responsible  for  the 
synchronization  of  the  40-Hz  oscillations  in  distant  thalamic  and  cortical 
sites.  Indeed  it  has  been  shown  that  neighboring  RE  cells  are  linked  by 
dendrodendritic  and  intranuclear  axon  collaterals  (Deschenes  et  al.  1985; 
Yenetal.  1985). 

BRAINSTEM  INFLUENCE  ON  THALAMIC  FIRING  MODE 

While  the  firing  mode  of  thalamocortical  cells  is  related  to  the  expression  of 
intrinsic  membrane  properties,  the  state-dependent  fluctuations  in  mem¬ 
brane  potential  seem  to  result  from  extrinsic  synaptic  influences.  Thus, 
during  REM  sleep  temporal  associations  that  generate  subjectivity  may 
not  coincide  with  the  temporal  maps,  and  only  strong  sensory  inputs  are 
capable  of  resetting  such  temporal  conditions.  In  short,  if  the  sensory  input 
coming  to  the  brain  is  not  put  in  the  context  of  thalamocortical  reality  by 
being  correlated  temporally  with  ongoing  activity,  the  stimulus  does  not 
exist  as  a  functionally  meaningful  event. 

If  this  is  the  case,  we  may  conclude  that  the  perception  of  external  reality 
is  an  intrinsic  function  of  the  CNS,  developed  and  honed  by  the  same  evo¬ 
lutionary  pressures  that  generated  other  specializations.  Moreover  this 
implies  that  secondary  qualities  of  our  senses  such  as  colors,  identified 
smells,  tastes,  and  sounds  are  inventions  of  our  CNS  that  allow  the  brain 
to  interact  with  the  external  world  in  a  predictive  manner  (Llinas  1988a). 
The  degree  to  which  our  perception  of  reality  and  "actual"  reality  overlap 
is  inconsequential  as  long  as  the  predictive  properties  of  the  computational 
states  generated  by  the  brain  meet  the  requirements  of  successful  interac¬ 
tion  with  the  external  world. 

If  we  assume  that  the  phase  shift  observed  in  these  preliminary  studies 
is  related  to  the  presence  of  coherent  waves  that  scan  our  brain  at  40  Hz, 
we  can  conclude  that  consciousness  is  not  a  continuous  event.  Rather  it  is 
determined  by  the  simultaneity  of  activity  in  the  thalamocortical  system 
modulated  by  the  brainstem,  and  fed — when  one  is  awake,  by  sensory 
input  and  when  one  is  asleep,  by  circuits  that  support  memories. 

THALAMOCORTICAL  RESONANCE  AS  THE  FUNCTIONAL  BASIS 
FOR  CONSCIOUSNESS 

From  the  above,  it  follows  that  the  major  development  in  the  evolution 
of  the  brain  of  higher  primates,  including  man,  is  the  enrichment  of  the 
corticothalamic  system.  This  is  supported  by  evolutionary  studies  if  one 
considers  the  increase  in  corticalization  in  mammals.  The  increase  in  the 
surface  area  of  the  neocortex  in  man  is  approximately  three  times  that  of 
higher  apes  (Lande  1979). 

How  can  this  thalamocorticothalamic  functional  state  generate  the 
unique  experience  we  all  recognize  as  existence  of  self  or  existence  of  the 
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here  and  now?  In  principle,  the  activity  generated  via  thalamocortical 
interactions  may  mimic  the  responsiveness  generated  during  the  waking 
state  (i.e.,  reality-emulating  states,  such  as  hallucinations,  may  be  gener¬ 
ated).  The  implications  of  this  proposal  are  of  some  consequence,  for  this 
means  that  if  consciousness  is  a  product  of  thalamocortical  activity,  it  is  the 
dialogue  between  the  thalamus  and  the  cortex  that  generates  subjectivity. 

EXPERIMENTS  SUPPORTING  THE  SIMILARITIES  OF  REM  SLEEP 
AND  WAKEFULNESS 

If,  as  stated  above,  40-Hz  thalamocortical  resonance  is  responsible  for  the 
global  temporal  mapping  that  generates  cognition,  such  global  conjunction 
should  be  present  during  the  dreaming  state.  In  fact,  it  has  been  recently 
reported  that  40-Hz  activity  occurs  in  an  organized  fashion  and  demon¬ 
strates  a  rostrocaudal  phase  shift  during  REM  sleep  (Llin£s  and  Ribary 
1993). 

Magnetoencephalography  (MEG)  was  utilized  in  that  study.  Three  sets 
of  studies  addressed  issues  concerning  (1)  the  presence  of  40-Hz  activity 
during  sleep,  (2)  the  possible  differences  between  40-Hz  resetting  in  dif¬ 
ferent  sleep/wakefulness  states,  and  (3)  the  question  of  40-Hz  scan  during 
REM  sleep. 

To  this  effect,  spontaneous  magnetic  activity  was  continuously  recorded 
and  filtered  at  35-45  Hz  during  wakefulness,  delta  sleep,  and  REM  sleep, 
using  a  37-channel  sensor  array.  Because  Fourier  analysis  of  the  sponta¬ 
neous,  broadly  filtered  rhythmicity  (1-200  Hz)  demonstrated  a  large  peak 
of  activity  at  40-Hz  over  much  of  the  cortex,  we  feel  that  such  filtering  is 
permissible.  Large  coherent  signals  with  a  very  high  signal-to-noise  ratio 
were  easily  recorded  from  all  37  sensors,  corresponding  to  activity  in  differ¬ 
ent  regions  of  the  cortex  (figure  6.2B).  This  single  0.6-sec  epoch  illustrates 
the  global  spontaneous  oscillation  in  an  awake  individual. 

A  second  set  of  experiments  examined  the  responsiveness  of  the  40- 
Hz  oscillation  to  stimuli  during  these  three  different  functional  states.  As 
shown  previously,  40-Hz  oscillation  may  be  reset  by  sensory  stimuli  (Llinds 
and  Ribary  1993;  Galambos  et  al.  1981;  Pantev  et  al.  1991).  This  is  clearly 
observed  following  auditory  stimulation.  In  these  experiments,  the  audi¬ 
tory  stimulus  consisted  of  frequency-modulated  500-msec  tone  bins,  trig¬ 
gered  100  msec  after  the  onset  of  the  600-msec  recording  epoch,  randomly 
sampled  over  a  time  period  of  approximately  10  minutes.  The  stimuli  were 
delivered  to  the  subject  during  conditions  of  wakefulness  (figure  6.2C), 
delta  sleep  (figure  6.2D),  and  REM  sleep  (figure  6.2E).  In  agreement  with 
previous  findings  (Llin£s  and  Ribary  1993;  Galambos  et  al.  1981;  Pantev 
et  al.  1991),  auditory  stimuli  (arrowhead)  produced  well-defined  40-Hz 
oscillation  (Ribary  et  al.  1991).  When  a  similar  set  of  stimuli  was  delivered 
during  delta  sleep,  no  resetting  was  observed  in  this  or  any  of  six  other 
subjects  where  this  experiment  was  performed,  resetting  was  not  observed 
during  REM  sleep,  as  shown  in  figure  6.2E. 
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Figure  6.2  Forty-Hertz  oscillation  in  wakefulness  and  a  lack  of  40-Hz  reset  in  delta  sleep, 
and  REM  sleep.  Recording  using  a  37-channel  MEG.  (A)  Diagram  of  sensor  distribution  over 
the  head;  in  (B)  the  spontaneous  magnetic  recordings  from  the  37  sensors  during  wakefulness 
are  shown  immediately  below  (filtered  at  35-45  Hz).  In  (C-F)  averaged  oscillatory  responses 
(300  epochs)  following  auditory  stimulus.  In  (C),  the  subject  is  awake  and  the  stimulus  is 
followed  by  a  reset  of  40-Hz  activity.  In  (D)  and  (E),  the  stimulus  produced  no  resetting  of 
the  rhythm.  (F)  The  noise  of  the  system  in  femtotesla  (fT).  (Modified  from  Llin4s  and  Ribary 
1993) 


The  level  of  coherence  present  at  all  recording  points  was  illustrated 
by  superimposing  the  37  traces  recorded  during  a  600-msec  epoch  (figure 
6.2C).  It  is  clear  from  such  recording  that  while  there  is  coherence  among 
the  different  recording  sites  there  is  also  a  phase  shift  of  the  oscillation 
along  the  different  sites  (Llinas  and  Ribary  1992). 

These  findings  indicated  that  while  electrically  the  awake  and  REM  sleep 
states  are  similar  with  respect  to  the  presence  of  40-Hz  oscillations,  the 
central  difference  between  these  states  is  the  lack  of  sensory  reset  of  the 
REM  40-Hz  activity.  By  contrast,  during  delta  sleep,  the  amplitude  of  these 
oscillators  differs  from  that  of  wakefulness  and  REM  sleep,  but  as  in  REM 
sleep,  there  is  no  40-Hz  sensory  response. 

The  findings  indicate  therefore  that  during  wakefulness  and  REM  sleep 
a  very  specific  40-Hz  thalamocortical  resonance  is  active  and  has  very  sim¬ 
ilar  global  properties.  Moreover,  while  both  states  can  generate  cognitive 
experiences,  the  recordings  indicate,  as  is  commonly  known,  that  the  ex¬ 
ternal  environment  is,  for  the  most  part,  excluded  from  the  imaging  of  the 
oneiric  states.  This  further  substantiates  a  recent  proposal  (Llinas  and  Pare 
1991)  that  the  dream  state  is  characterized  by  an  increased  attentiveness 
to  an  intrinsic  state  in  the  sense  that  external  stimuli  do  not  perturb  the 
intrinsic  activity. 
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Figure  6.3  Rostrocaudal  phase  shift  of  40-Hz  during  REM  sleep  as  measured  using  MEG 
(see  also  figure  6.2).  The  upper  trace  (A)  shows  synchronous  activation  in  all  37  channels 
during  a  600-msec  period.  The  oscillation  in  the  left  part  of  trace  (7\)  has  been  expanded  in 
trace  (B)  to  show  five  different  recording  sites  over  the  head.  The  five  recording  sites  of  trace 
(B)  are  displayed  in  (C)  for  a  single  epoch  to  demonstrate  the  phase  shift  for  the  different 
40-Hz  waves  during  REM  sleep.  The  direction  of  the  phase  shift  is  illustrated  by  an  arrow 
above  (C).  The  actual  traces  and  their  site  of  recordings  for  a  single  epoch  are  illustrated  in 
(D)  for  all  37  channels.  fT,  femtotesla.  (Modified  from  Llinds  and  Ribary  1993) 


In  a  third  set  of  experiments  the  issue  of  the  front-to-back  phase  shift 
of  the  40-Hz  activity  over  the  head  during  REM  sleep  was  addressed. 
Spontaneous  40-Hz  activity  during  a  single  0.6-sec  epoch  in  REM  sleep 
(figure  6.3A  and  B)  and  an  expanded  portion  of  this  burst  (figure  6.3B)  show 
the  well-organized  12-msec  phase  shift  for  the  40-Hz  oscillation  observed 
from  recording  sites  1  to  5,  as  illustrated  schematically  in  figure  6.3C.  The 
actual  recording  sites  are  illustrated  for  the  epoch  shown  in  A  in  figure  6.3D. 
A  similar  12-msec  phase  shift  was  also  observed  in  the  same  individual  in 
the  awake  state  with  the  exception  that,  during  REM  sleep,  the  rostrocaudal 
sweep  is  better  organized  and  more  repeatable,  probably  since  the  sweep 
is  not  continually  reset  by  incoming  sensory  stimuli. 

The  significant  new  finding  here  is  the  fact  that  during  the  period  cor¬ 
responding  to  REM  sleep  (in  which  a  subject,  if  awakened,  reports  hav¬ 
ing  been  dreaming),  40-Hz  oscillation  similar  in  distribution  phase  and 
amplitude  to  that  observed  during  wakefulness  is  observed.  In  the  five 
individuals  in  whom  these  recordings  were  made,  the  overall  speed  of 
the  rostrocaudal  scan,  which  averaged  approximately  12.5  msec,  corre¬ 
sponded  quite  closely  to  half  a  40-Hz  period.  This  number  is  the  same  as 
that  calculated  by  Kristofferson  (1984)  for  a  quantum  of  consciousness  in 
his  psychophysical  studies  in  the  auditory  system. 
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A  second  significant  finding  related  to  the  fact  that  during  the  dream¬ 
ing  state,  40-Hz  oscillations  are  not  reset  by  sensory  input  although  clear 
evoked  potential  responses  indicate  that  the  thalamoneocortical  system  is 
accessible  to  sensory  input  (Llin&s  and  Pare  1991;  Steriade  1991).  This  we 
consider  to  be  the  central  difference  between  dreaming  and  wakefulness. 
The  recordings  suggest  that  we  do  not  perceive  the  external  world  during 
REM  sleep  because  the  intrinsic  activity  of  the  nervous  system  does  not 
place  sensory  input  in  the  context  of  the  functional  state  being  generated 
by  the  brain  at  that  time  (Llinds  and  Pare  1991).  That  is,  that  the  dreaming 
condition  is  a  state  of  hyperattentiveness  in  which  sensory  input  cannot 
address  the  machinery  that  generates  conscious  experience.  Relating  to 
the  morphophysiological  basis  for  this  scanning  property,  a  very  attractive 
hypothesis  could  be  that  the  "nonspecific"  thalamic  system — in  particular, 
the  intralaminar  complex— may  be  an  important  part  of  this  process.  In¬ 
deed,  the  intralaminar  complex  represents  a  cellular  mass  that  projects  to 
the  most  superficial  layers  of  all  cortical  areas,  to  include  primary  sensory 
cortices  (Jones  1985)  in  a  spatially  continuous  manner.  The  cells  in  this 
group  may  also  have  the  necessary  interconnectivity  to  sustain  a  propaga¬ 
tion  wave  within  the  nucleus,  which  could  result  in  the  40-Hz  phase  shift 
observed  at  the  cortical  level  that  is  generating  the  rostrocaudal  12.5-msec 
phase  shift.  This  possibility  is  particularly  attractive  given  that  damage 
of  the  intralaminar  system  results  in  lethargy  or  coma  (Facon  et  al.  1958; 
Castaigne  et  al.  1962)  and  that  the  electrophysiological  properties  of  sin¬ 
gle  neurons,  especially  during  REM  sleep,  burst  in  firing  with  a  30-40-Hz 
periodicity  (Steriade  et  al.  1993)  as  is  in  keeping  with  the  macroscopic 
magnetic  recordings  observed  in  this  study. 

BINDING  BY  SPECIFIC  AND  NONSPECIFIC  40-HZ  RESONANT 
CONJUNCTIONS 

The  results  reported  above  and  other  recent  findings  indicate  that  40-Hz 
oscillation  is  present  at  many  levels  in  the  CNS.  Indeed,  such  a  property 
is  found  in  sites  as  peripheral  as  the  retina  (Ghose  and  Freeman  1992), 
and  olfactory  bulb  (Bressler  and  Freeman  1980),  in  the  thalamus,  specific 
and  nonspecific  (Steriade  et  al.  1993a),  in  the  thalamic  reticular  nucleus 
(Pinault  and  Deschenes  1992a),  and  in  the  neocortex  (Llinas  et  al.  1991). 
Moreover,  it  has  been  shown  that  some  of  the  40-Hz  recorded  in  the  visual 
cortex  is  correlated  with  retinal  40-Hz  (Ghose  and  Freeman  1992).  Thus, 
40-Hz  oscillation  involves  not  only  the  cortical  but  also  the  thalamocortical 
interactions.  Such  a  possibility  is  indicated  in  the  diagrams  in  figure  6.4. 
Forty-Hertz  oscillation  of  specific  thalamocortical  neurons  (Steriade  et  al. 
1991)  can  establish  (as  shown  in  figure  6.4,  left)  thalamocortical  resonance 
via  fourth-layer  inputs,  which  resonates  with  inhibitory  intemeurons  at 
that  level  (Llinas  et  al.  1991).  Such  oscillation  can  reenter  the  thalamus 
via  the  layer  4  pyramidal  cells  (Steriade  et  al.  1990)  and  resonate  with 
both  the  nucleus  reticularis  and  in  the  specific  thalamic  nuclei  (Pantev  et 
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Figure  6.4  Thalamocortical  circuits  proposed  to  subserve  temporal  binding.  Diagram  of 
two  thalamocortical  systems.  (Left)  Specific  sensory  or  motor  nuclei  project  to  layer  4  of 
the  cortex,  producing  cortical  oscillation  by  direct  activation  and  feedforward  inhibition  via 
40-Hz  inhibitory  intemeurons.  Collaterals  of  these  projections  produce  thalamic  feedback 
inhibition  via  the  nucleus  reticularis.  The  return  pathway  (circular  arrow  on  the  left)  reenters 
this  oscillation  to  specific  and  reticularis  thalamic  nuclei  via  layer  6  pyramidal  cells.  (Right) 
Second  loop  shows  nonspecific  intralaminary  nuclei  projecting  to  the  most  superficial  layer 
of  the  cortex  and  giving  collaterals  to  the  reticular  nucleus.  Layer  5  pyramidal  cells  return 
oscillation  to  the  reticular  and  the  nonspecific  thalamic  nuclei,  establishing  a  second  resonant 
loop.  The  conjunction  of  the  specific  and  nonspecific  loops  is  proposed  to  generate  temporal 
binding.  (Modified  from  Llin^s  and  Ribary  1993) 

al.  1991).  This  view  is  therefore  different  from  the  binding  hypothesis 
proposed  by  Crick  and  Koch  where  the  authors  proposed  cortical  binding 
due  to  the  activation  of  specific  thalamic  inputs  (Crick  and  Koch  1990a;  see 
also  chapter  5 ). 

On  the  other  hand,  a  second  system  (figure  6.4,  right)  is  represented 
by  the  intralaminary  cortical  input  to  layer  1  of  the  cortex  and  its  return- 
pathway  projection  via  fifth  and  sixth  layer  pyramidal  systems  to  the  in¬ 
tralaminary  nucleus,  directly  and  indirectly,  via  collaterals  to  the  nucleus 
reticularis  (Jones  1985).  The  cells  in  this  system  have  been  shown  to  oscil- 
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late  in  40-Hz  bursts  (Steriade  et  al.  1993a),  and  to  be  organized  in  space  as 
a  toroidal  mass  having  the  possibility  of  recursive  activation  (Krieg  1966), 
which  could  result  in  the  recurrent  activity  ultimately  responsible  for  the 
rostrocaudal  cortical  activation  found  in  the  present  MEG  recordings. 

Finally,  it  is  also  evident  from  the  literature  that  neither  of  these  two 
circuits  alone  can  generate  cognition.  Indeed,  as  stated  above,  damage 
of  the  nonspecific  thalamus  produces  deep  disturbances  of  consciousness 
while  damage  of  specific  systems  produces  loss  of  the  particular  modality. 

From  the  above,  a  very  tentative  hypothesis  may  be  proposed  relating  to 
the  overall  organization  of  brain  function  in  very  gross  and  oversimplified 
terms.  Indeed,  the  "specific"  thalamocortical  system  (to  be  understood  not 
only  as  that  relating  to  the  primary  sensory  modalities  but  rather  to  nuclei 
that  project  mainly,  if  not  exclusively,  to  layer  4  in  the  cortex,  whether 
sensorimotor  or  associative)  is  viewed  as  encoding  specific  sensory  and 
motor  "information"  by  the  resonant  thalamocortical  system  specialized 
to  receive  such  inputs  (e.g.,  the  LGN  and  visual  cortex). 

If  this  were  to  be  the  case  and  optimal  activation  of  any  such  loop  would 
tend  to  oscillate  at  close  to  40-Hz,  activity  in  the  "specific"  thalamocortical 
system  could  then  be  easily  "recognized"  over  the  cortex  by  this  oscillatory 
characteristic.  In  such  a  scheme  then,  areas  of  cortical  sites  "peaking"  at 
40-Hz  would  represent  the  different  components  of  the  cognitive  world 
that  have  reached  optimal  activity  at  that  point  in  time.  The  problem  now 
would  be  that  of  the  conjunction  of  such  a  fractured  description  into  a 
single  cognitive  event.  This  could  be  done,  we  propose,  by  the  summation 
of  40-Hz  activity  along  the  radial  dendritic  axis  of  the  cortical  elements, 
which  would  occur  when  the  specific  and  nonspecific  40-Hz  activity  is 
superposed  in  time,  and  on  the  same  set  of  neurons. 

In  short,  the  system  would  work  by  bringing  central  neurons  to  opti¬ 
mal  firing  patterns  via  dendritic  integrations  based  on  passive  and  active 
dendritic  conduction  along  the  apical  dendritic  core  conductors.  In  this 
manner,  the  time-coherent  activity  of  the  specific  and  unspecific  oscilla¬ 
tory  inputs,  by  summing  distal  and  proximal  activity  in  given  dendritic 
elements,  would  serve  to  enhance  de  facto  40-Hz  cortical  coherence  by  their 
multimodal  character,  and  serve  as  one  mechanism  for  global  binding. 

In  this  manner  the  specific  system  would  provide  the  content,  and  the 
nonspecific  system  the  temporal  conjunction  of  such  content,  into  a  single 
cognitive  experience. 

SO,  WHY  DO  WE  DREAM? 

At  this  time  nothing  other  than  hypotheses  can  be  offered  with  respect  to 
the  final  physiological  role  of  dreaming  in  brain  function.  An  excellent 
summary  of  these  different  points  of  view  has  been  published  recently. 
(Hobson  1988). 

The  categories  discussed  most  often  today  relate  to  either  (1 )  the  Freudian 
view  concerning  a  subconscious  drive  or  (2)  the  "mesencephalic"  origin,  in 
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which  dreaming  is  considered  as  the  forebrain  interpretation  of  otherwise 
meaningless  "brainstem  noise."  We  have  a  different  view  that  is  based  on 
the  fact  that  more  often  than  not  one  either  dreams  about  recent  events  or 
about  ongoing  problems.  On  this  basis  one  may  consider  that  dreaming 
may  be  the  necessary  consequence  of  the  parallel  nature  of  the  neuronal 
organization  in  the  CNS.  So,  given  a  particular  question  to  be  resolved,  the 
CNS  generally  embarks  on  simultaneous  but  diverse  possible  solutions  to 
such  problems,  in  a  parallel  fashion.  Given  this  strategy,  chances  are  that 
a  given  solution  is  arrived  at  before  other  alternatives. 

This,  however,  does  not  mean  that  the  alternatives  are  not  considered 
further.  In  fact,  it  often  happens  that  having  come  to  a  solution  considered 
adequate,  a  second  may  "pop  up"  in  one's  mind  at  a  later  time.  We  may 
further  consider  that  at  the  end  of  the  day  we  may  have  many  such  partial 
computations  being  performed  prior  to  our  falling  asleep.  The  possibility 
is  there  that  in  dreaming  we  "download"  the  other  possible  solutions  and 
thus  prevent  the  overloading  of  circuits  with  the  accumulation  of  an  ever- 
increasing  set  of  ongoing  partial  solutions  as  new  problems  are  considered. 
This  particular  point  of  view  may  be  supported  in  part  by  the  fact  that 
excellent  solutions  to  problems  may  arise  in  dreams. 

Something  quite  similar  may  be  said  with  regard  to  slow-wave  sleep. 
In  this  case  the  very  slow  oscillatory  nature  of  the  neuronal  rhythmicity 
observed  during  this  functional  state  (Steriade  et  al.  1993c)  is  probably 
closer  to  the  grooming  functions  that  most  animals  perform  upon  finishing 
a  task— whether  it  is  the  smoothing  of  ruffled  feathers  after  flight  or  the 
grooming  of  vibrissa  following  ingestion  of  food.  What  all  of  this  may  have 
in  common  is  that  the  time  spent  to  generate  repeating  and  rather  simple 
movements  (i.e.,  grooming)  or  repeating  and  rather  simple  patterns  of 
brain  activity  facilitates  the  return  of  neuronal  circuits  to  a  readiness  state 
that  follows  the  end  of  grooming  of  peripheral  receptors  and  effectors. 

The  use  of  such  metaphors  is  but  a  first  approximation  to  what  may 
ultimately  be  shown  to  be  the  real  function  of  sleeping  and  dreaming. 
However,  when  one  considers  the  fact  that  sleeping  is  of  such  importance 
that  no  higher  nervous  system  has  evolved  away  from  this  time-consuming 
activity,  we  must  assume  that  its  presence  is  vital  to  normal  brain  function. 
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As  is  abundantly  clear  from  the  other  chapters  of  this  book,  there  are  many 
levels  at  which  one  can  attack  the  problem  of  modeling  the  computations 
of  the  cortex.  For  example,  at  one  extreme,  one  can  model  how  the  action 
potentials  received  at  each  synapse  are  combined  in  the  dendritic  tree,  or, 
at  the  other,  one  can  develop  a  functional  theory  of  the  different  cortical 
areas.  But,  in  addition  to  choosing  a  level,  modeling  requires  you  to  choose 
some  description  for  the  class  of  problems  that  you  expect  the  cortex  is  solv¬ 
ing,  or  the  class  of  signals  that  you  expect  the  cortex  to  be  processing.  Folk 
psychology  provided  the  labels  for  the  original  cortical  area  theory  of  Gall, 
and  cognitive  psychology  continues  to  provide  a  more  sophisticated  frame¬ 
work  for  assigning  task  and  function  labels  to  cortical  areas  (cf .  Luria  1962; 
Fodor  1983;  Kosslyn  and  Koenig  1992).  Neurologists  use  the  results  of  a 
limited  battery  of  tests,  supplemented  by  their  own  ability  to  empathize 
with  the  mental  state  of  their  patients,  as  the  evidence  to  be  correlated  with 
the  nature  of  the  brain  damage.  For  several  decades,  visual  neurophys¬ 
iologists  have  relied  on  the  presentation  of  moving  edges  and  bars  and 
sine  wave  gratings:  the  implicit  assumption  is  that  distinctive  patterns  of 
response  to  these  embody  the  basic  elements  of  low  level  visual  processing. 

The  point  of  departure  of  this  chapter  is  the  proposition  that  the  com¬ 
putational  analysis  of  vision — and  speech,  tactile  sensing,  motor  control, 
etc. — (the  theory  of  the  computation  as  Marr  called  it  [Marr  1982])  is  reach¬ 
ing  a  point  where  it  can  provide  a  clearer  and  deeper  description  of  the 
essential  tasks  of  vision  as  well  as  a  wide  range  of  other  cognitive  tasks. 
For  instance,  the  development  of  algorithms  for  character  recognition  or 
for  face  recognition  or  for  road  tracking  from  a  moving  vehicle  (three  prob¬ 
lems  that  have  been  much  studied  on  account  of  their  potential  applica¬ 
tions)  forces  the  researcher  to  deal  with  noisy,  complex  real  world  data.  In 
doing  this,  one's  initial  ideas  about  what  parts  of  the  problem  are  difficult, 
what  parts  are  simple,  may  turn  out  to  be  quite  wrong.  Quite  often,  a 
step  that  one  thinks  of  as  a  simple  preprocessing  clean  up  operation  turns 
out  to  be  very  difficult  and  pinpoints  for  you  a  new  class  of  problems  that 
had  been  ignored.  Introspection  turns  out  often  to  be  a  very  poor  guide  to  the 
complexity  of  a  problem.  The  reason  for  this,  we  believe,  is  our  subjective 
impression  of  perceiving  instantaneously  and  effortlessly  the  significance 


of  sensory  patterns  (e.g.,  the  word  being  spoken  or  which  face  is  being 
seen).  Many  psychological  experiments,  however,  have  shown  that  what 
we  perceive  is  not  the  true  sensory  signal,  but  a  rational  reconstruction  of 
what  the  signal  should  be.  This  means  that  the  messy  ambiguous  raw  sig¬ 
nal  never  makes  it  to  our  consciousness  but  gets  overlaid  with  a  clearly  and 
precisely  patterned  version  that  could  never  have  been  computed  without 
the  extensive  use  of  memories,  expectations,  and  logic.  Only  when  you  at¬ 
tempt  to  duplicate  such  a  skill  by  computer  do  you  discover  all  the  hidden 
complexity  in  the  computation. 

We  believe  that  this  analysis,  which  we  call  "Pattern  Theory"  (a  term 
introduced  in  the  pioneering  work  of  Grenander  some  15  years  ago),  leads 
not  merely  to  a  few  broad  guidelines  on  the  problems  faced  by  a  brain,  but 
to  a  rather  specific  set  of  computational  tasks,  and  to  a  flow  chart  of  how 
the  pieces  should  be  put  together.  This  analysis  is  very  different  from  most 
of  the  orthodox  analyses  of  cognitive  problems:  it  is  very  distinct  from 
the  standard  AI  view,  which  takes  formal  logic  and  the  formal  linguists' 
analysis  of  language  into  atomic  units  and  air  tight  rules,  as  the  universal 
language  of  cognition.  As  we  shall  see,  it  fits  naturally,  instead,  with  such 
nonlogical  data  structures  as  probabilities,  fuzzy  sets,  and  population  cod¬ 
ing.  Moreover,  it  is  very  distinct  from  the  pure  feedforward  analyses  such 
as  Marr's  analysis  of  vision  (Marr  1982),  in  that  it  is  based  in  an  essential 
way  on  a  relaxation  between  feedforward  and  feedback  processes.  Hav¬ 
ing  this  analysis,  we  can  go  directly  to  neuroanatomy  and  neurophysiology 
and  ask  if  there  are  structures  in  the  brain  that  suggest  being  designed  to 
implement  one  or  more  of  these  basic  computational  building  blocks.  If 
these  computations  do  indeed  represent  fundamental  cognitive  operations, 
one  hopes  that  the  basic  circuitry  is  not  hidden,  but  clearly  expressed  in 
the  anatomy  of  the  cortex,  especially  in  its  layers,  pathways,  and  cell  types. 
The  method  to  follow,  we  believe,  is  to  seek  the  simplest  mechanisms  com¬ 
patible  with  present  knowledge  of  the  anatomy  and  physiology  of  cortex, 
seeking  direct  analogies  between  the  computational  architecture  and  the 
neural  architecture. 

In  the  next  section,  we  outline  the  ideas  of  Pattern  Theory  and  introduce 
three  basic  ideas  of  this  theory.  There  follow  sections  in  which  each  of 
these  ideas  is  detailed  and  its  connections  with  neuroanatomy  and  neuro¬ 
physiology  are  described.  We  suggest,  where  possible,  the  most  specific 
predictions  these  theories  make  and  propose  experimental  tests  in  sev¬ 
eral  cases.  The  biological  ideas  in  this  paper  are  developments  of  those 
described  in  our  earlier  two-part  paper  (Mumford  1991,  1992).  The  for¬ 
malism  of  Pattern  Theory  presented  here  is  developed  at  greater  length  in 
Mumford  (1993). 

WHAT  IS  PATTERN  THEORY? 

The  starting  point  of  Pattern  Theory  is  the  idea  that  sensory  signals  are 
coded  versions  of  what  is  really  going  on  in  the  world,  and  that  the  task 
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of  sensory  information  processing  is  to  reconstruct  as  much  as  possible  a 
full  description  of  the  state  of  the  world.  We  may  define  the  goals  of  the 
field  as 

the  analysis  of  the  patterns  generated  by  the  world  in  any  modality,  with 
all  their  naturally  occurring  complexity  and  ambiguity,  with  the  goal  of 
reconstructing  the  processes,  objects,  and  events  that  produced  them. 

For  example,  these  patterns  may  be  those  of  visual  signals,  that  is,  2D 
arrays  of  intensity  and  color  measurements  as  received  by  the  rods  and 
cones  in  the  retina.  Or  they  may  be  the  patterns  of  auditory  signals,  that 
is,  the  time-varying  vibration  patterns  of  the  inner  hair  cells  generated 
by  the  complex  cochlear  filter.  In  the  visual  example,  one  seeks  first  to 
reconstruct  the  pattern  of  discrete  objects  in  the  world,  their  distances 
from  the  observer,  surface  markings,  and  how  they  are  illuminated  so 
as  to  produce  the  observed  signal.  In  the  case  of  speech,  the  first  step  is 
to  reconstruct  the  events  in  the  throat  and  mouth  of  the  speaker  and  then 
to  label  these  as  the  events  associated  to  specific  phonemes  in  a  specific 
language,  plus  pitch  and  stress  data  to  be  used  in  further  processing. 

But  Pattern  Theory  goes  further  and  asserts  that  a  parallel  analysis  can  be 
applied  to  higher  cognitive  levels  as  well.  Consider  a  medical  expert  sys¬ 
tem — or  a  physician.  Both  of  these  educated  devices  accept  as  input  a  des¬ 
cription  of  the  symptoms,  test  results,  and  a  partial  history  of  a  specific  pa¬ 
tient.  This  data  can  be  viewed  as  a  coded  signal  generated  by  the  processes 
at  work  in  the  patient's  body.  The  task  of  medical  expert  system  or  the  phy¬ 
sician  is  to  reconstruct  a  full  description  of  these  hidden  processes.  Many 
cognitive  tasks  can  be  analyzed  in  this  way.  The  world  contains  unknown 
processes,  objects,  and  events— hidden  random  variables  in  the  language 
of  the  probabilist.  But  they  are  not  totally  hidden,  as  partial  encoded  infor¬ 
mation  about  them  comes  to  the  observer  through  various  sensory  channels 
or  lower  level  analyses.  The  goal  is  to  estimate  the  world  variables. 

How  does  Pattern  Theory  propose  to  carry  out  this  reconstruction?  There 
are  three  characteristic  ideas  in  Pattern  Theory.  The  first  idea  is  that  to  suc¬ 
cessfully  reconstruct  the  world  variables,  one  must  learn  to  synthesize  the 
coded  signals  that  one  observes,  so  that  tentative  reconstructions  of  the 
world  variables  can  be  checked  by  comparing  the  actual  observed  signal 
with  synthesized  signals.  This  means  that  the  architecture  is  not  purely 
feedforward,  bottom-up,  but  fundamentally  recursive  combining  feedfor¬ 
ward  actions  with  feedback,  top-down  processing  with  bottom-up.  The 
second  idea  is  that  the  encoding  processes,  which  transform  the  state  of 
the  world  into  the  received  sensory  signal,  are  not  completely  arbitrary 
(e.g.,  the  logician's  general  recursive  functions),  but  processes  of  several 
restricted  sorts — deformations  is  Grenander 's  word — that  reoccur  in  all  sen¬ 
sory  channels  and  in  higher  cognitive  problems.  This  means  that  the  archi¬ 
tecture  can  be  customized  to  decode  these  specific  types  of  deformations  to 
reconstruct  the  state  of  the  world.  The  third  idea  is  that  this  reconstruction 
can  (and  must)  be  learned  from  experience,  that  one  learns  both  which 
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hidden  variables  best  describe  the  patterns  in  the  signals,  hence  the  world 
itself,  and  the  priors  on  these  variables  to  be  able  to  best  compute  them.  In 
the  rest  of  this  chapter,  we  want  to  discuss  these  three  ideas. 

THE  ANALYSIS-SYNTHESIS  LOOP  AND  CORTICAL  FEEDBACK 
PATHWAYS 

Two  Different  Flow  Charts 

The  first  basic  idea  of  Pattern  Theory  is  that  to  analyze  some  class  of  signals, 
you  must  learn  to  synthesize  these  signals  given  typical  values  of  the  world 
variables.  To  recognize  some  class  of  objects  visually,  you  must  know  how 
to  synthesize  images  of  them;  to  recognize  words,  you  must  know  how  to 
synthesize  the  actual  sound  patterns;  to  diagnose  a  disease,  you  must  be 
able  to  describe  its  typical  presenting  symptoms. 

Although  this  sounds  like  common  sense,  it  distinguishes  Pattern  The¬ 
ory  from  the  majority  of  computational  and  modeling  theories,  because  it 
implies  that  top-down  feedback  processes  are  just  as  important  as  bottom- 
up  feedforward  processes.  Consider  how  many  classification  algorithms 
are  purely  feedforward:  feature-based  winner-take-all  ("Pandemonium") 
algorithms,  feedforward  neural  nets  (even  with  backpropagation,  in  which 
feedback  is  used  for  learning,  but  not  in  practice),  tree-based  classifiers  like 
CART,  and  parametric  statistical  modeling.  None  of  these  handles  grace¬ 
fully  a  new  and  unexpected  stimulus,  because  they  have  not  explicitly  mod¬ 
eled  the  stimuli  they  have  been  trained  on,  and  therefore  cannot  recognize 
novelty.  At  best,  they  can  incorporate  significance  levels,  and  flag  suspi¬ 
cious  stimuli  if  none  of  their  categories  fits  with  overwhelming  significance. 
Unfortunately,  this  often  miscarries  with  borderline  cases.  One  reason  is 
that,  because  of  the  distortions  caused  by  "interruptions"  (i.e.,  overlapping 
objects,  events  or  processes— see  below),  correct  instances  of  a  category  are 
often  present  but  with  part  of  their  characteristic  pattern  missing  (e.g.,  a 
letter  partially  covered  by  an  ink  blot).  In  this  case,  part  of  the  stimulus 
will  fit  the  category  very  well,  part  not  at  all,  and  a  feedforward  classifier 
may  mistake  them  for  a  different  category.  In  contrast,  incorrect  instances, 
like  a  letter  from  a  foreign  alphabet,  may  roughly  resemble  one  of  expected 
categories,  say  an  english  letter,  and  therefore  be  mistaken  for  it  by  a  feed¬ 
forward  classifier.  The  moral  is  that  it  is  much  more  significant  for  a  part 
of  the  stimulus  to  match  closely  the  prototype  of  a  category,  than  for  all 
of  it  to  match  slightly.  This  kind  of  distinction  cannot  be  made  unless  a 
top-down  synthesis  stage  is  part  of  the  recognition  algorithm. 

The  simplest  type  of  pattern  synthesis  consists  in  simply  storing  proto¬ 
types  or  templates  for  each  category  to  be  recognized.  Note  that  this  is 
not  the  same  thing  as  storing  prototype  feature  vectors  (e.g.,  mean  values 
of  the  features  for  all  instances  of  signals  from  a  given  category).  This 
is  because  there  is  usually  no  way  to  reconstruct  the  signal  itself  from  its 
features.  In  contrast,  a  template  (as  the  word  is  used  in  traditional  pat- 
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tern  recognition)  is  a  particular  signal  that  can  be  directly  compared  with 
the  incoming  signal.  Such  templates  are  also  incorporated  in  the  pattern 
completion  operation  of  various  neural  nets  such  as  Kohonen's  and  in  the 
seeking  of  "energy  minima"  in  the  attractive  neural  nets  of  Hop  field.  In  a 
simple  world  such  templates  might  suffice  but,  because  the  many  different 
signals  belonging  to  a  single  category  (e.g.,  all  varieties  of  the  letter  A)  differ 
by  complex  transformations  such  as  domain  warping  (see  below),  a  single 
template  will  rarely  match  the  actual  signal  at  all  well.  Too  many  factors 
affect  every  real  world  stimuli  for  a  simple  Sears-Roebuck  catalog  of  pat¬ 
terns  to  be  useful.  Each  instance  of  a  category  can  be  positively  identified 
only  by  actively  synthesizing  it:  combining  the  templates  of  those  objects 
or  processes  present  on  all  scales,  distorting  them  in  the  correct  ways,  and 
removing  parts  that  are  absent.  This  is  why  Pattern  Theory  presupposes  an 
analysis-synthesis  loop  in  which  feature  extraction  and  feedforward  style 
classification  is  combined  with  a  feedback  step  in  which  the  system  at¬ 
tempts  to  duplicate  the  stimulus  by  combining  and  transforming  its  basic 
prototypes. 

Figure  7.1  contrasts  the  flow  charts  of  traditional  bottom-up  recognition 
systems  with  that  of  Pattern  Theory.  Note  that  Pattern  Theory  proposes 
that  analysis  and  synthesis  should  be  carried  out  iteratively.  Thus,  at  the 
first  stage,  if  there  is  no  expected  pattern,  the  features  of  the  actual  signal 
are  extracted  exactly  as  in  the  traditional  flow  chart  and  passed  to  a  recog¬ 
nizer.  However ,  next  the  recognizer  draws  on  its  database  of  prototypes  to 
synthesize  a  standard  instantiation  of  the  hypothetical  object  being  seen.  In 
subsequent  iterations,  the  hypothesis  will  be  refined:  details  on  size,  orien¬ 
tation,  shading  if  present,  and  missing  and/or  extra  parts  will  be  computed 
by  comparing  the  synthesized  image  with  the  true  image  and  computing 
features  of  the  residual  or  difference  between  these.  That  does  not  mean 
that  the  true  image  is  thrown  away.  But  a  steady  state  would  mean  that  the 
synthesized  image  agrees,  up  to  acceptable  error,  with  the  true  image  and 
the  features  of  the  residual  are  too  small  to  modify  the  hypothesis  further. 
There  is  no  need  to  send  any  more  feedforward  signals  when  the  feedback 
pathway  already  predicts  the  input  signal.  (This  is  like  driving  home  on 
a  well-known  road  and  not  needing  to  pay  attention  to  anything  that  you 
see  because  it  always  agrees  with  what  you  expect,  hence  never  generates 
a  residual.) 

What  is  an  acceptable  error  in  synthesizing  the  signal  is  something  that 
must  also  be  modeled  explicitly  and  differently  for  each  category  of  signal. 
Thus  modeling  the  detailed  contour  of  the  nose  is  quite  significant  for  face 
recognition,  but  modeling  the  shape  of  a  stapler  is  not  significant  when 
performing  office  tasks.  Modeling  the  details  of  the  grain  of  an  oak  floor 
is  not  significant,  but  the  exact  shape  of  the  stripes  or  spots  on  the  back 
of  a  large  member  of  the  cat  family  is.  This  is  a  major  difference  between 
Pattern  Theory  and  Barlow's  theory  (see  chapter  1).  In  Barlow's  theory, 
modeling  patterns  allows  you  to  distinguish  that  part  of  the  signal  that  is 
familiar  and  has  predictable  structure  from  the  novel  information  in  the 
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Figure  7.1  (Top)  The  traditional  bottom-up  approach  to  recognition  in  which  a  feature  vector 
is  computed  first  and  this  compared  with  prototype  vectors,  one  for  each  category.  (Bottom) 
The  alternative  proposed  by  Pattern  Theory  in  which  a  bottom-up/top-down  relaxation  ex¬ 
plicitly  models  the  image  by  comparing  it  with  images  synthesized  from  high-level  descrip¬ 
tions. 


signal — which  resembles  noise.  Pattern  Theory,  however,  distinguishes 
two  parts  to  this  "information":  the  high-level  description  from  which  the 
signal  is  being  synthesized  and  the  residual  error  that  is  hard  or  impossible 
to  model.  The  former  is  truly  informative  and  is  passed  on  to  higher  levels, 
and  the  latter  is  discarded  as  being  truly  noise. 

Note  that  the  flow  chart  of  Pattern  Theory  is  also  different  from  that  pro¬ 
posed  by  Poggio  (e.g.,  in  chapter  8  by  Poggio  and  Hurlbert).  They  propose 
a  very  specific  mechanism  for  combining  multiple  instances  of  a  specific 
category  by  comparing  each  with  the  true  signal  and  interpolating.  But  this 
comparison  is  feedforward  and  is  hard-wired  by  radial  basis  functions, 
so  that  if  further  kinds  of  variability  are  encountered,  one  must  multiply 
the  sets  of  stored  instances,  allowing  for  all  combinations  of  each  type  of 
variability.  In  contrast.  Pattern  Theory  is  feedback,  so  it  can  synthesize  dy¬ 
namically  every  new  signal  and  thus  potentially  model  a  much  larger  class 
of  deformations.  How  this  can  be  done  neurally  will  be  discussed  below. 

This  feedback  stage  is  not  unlike  mental  imagery ,  which,  as  Kosslyn  has 
discovered,  is  a  complex  synthesis  and  reconstruction  of  something  that  has 
all  the  qualities  of  actual  stimuli  from  the  external  world.  As  he  suggests 
and  both  MRI  and  PET  scans  seem  now  to  confirm,  this  something  may  be 
low-level  activity  in  the  sensory  areas  of  the  brain,  even  VI,  just  like  what 
we  propose  for  our  feedback  (see  Le  Bihan  et  al.  1992;  Kosslyn  et  al.  1993). 
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We  may  summarize  our  argument  by  saying: 

Synthesis  Use  of  (flexible)  Mental 

Feedback  =  signals  from  =  =  . 

°  templates  imagery 

memory 

To  give  these  ideas  a  more  concrete  flavor,  we  want  to  take  a  particular 
image:  the  old  man  on  a  bench  shown  in  figure  7.2a.  We  assume  that  you 
instantly  recognize  the  content  of  the  image.  But  how  did  you  do  this?  A 
blow  up  of  his  face  (at  the  same  resolution)  is  shown  in  figure  7.2c:  his 
ear  is  the  only  vaguely  recognizable  part  of  his  face  and  his  hand  blends 
into  his  face,  creating  the  two  utterly  misleading  spots  of  light  where  you 
see  past  his  face.  Figure  7.2b  shows  what  a  state-of-the-art  edge  detector 
(Canny's)  produces  (such  detectors  require  various  parameters  to  be  set 
by  the  user  and  we  have  selected  those  that  seemed  more  or  less  optimal): 
not  only  are  the  edges  of  his  face  not  found,  but  even  the  outline  of  his  coat 
is  fragmented.  Finally,  note  that  the  most  salient  "object"  in  the  image  is 
his  cap,  which,  by  itself,  could  be  virtually  anything.  How  do  feedback 
loops  help  you  analyze  this  man?  There  are  two  stages  here:  in  the  low- 
level  feedback  loops,  low-level  templates  and  low-level  segmentation  (= 
clustering  into  distinct  objects)  take  place,  while  in  the  high-level  feedback 
loops,  models  of  objects  such  as  bodies,  heads,  and  benches  are  fit  to  the 
image.  To  make  this  plausible,  let  me  point  out  how  much  could,  in  prin¬ 
ciple,  be  done  in  low-level  fitting  operations:  first,  the  pieces  of  the  bench 
on  each  side  of  the  man  can  be  grouped,  using  an  interupted  line  template. 
Next,  a  textured,  fragmented  contour  along  the  back  of  his  coat  can  be  as¬ 
sembled  into  a  model  of  a  backlit,  wrinkled,  and  rounded  object.  And  his 
cap  comes  forward  because  it  occludes  the  background  and  his  face  and 
simultaneously  the  fact  that  the  black  triangle  over  his  eyes  is  a  shadow 
can  be  deduced.  All  of  these  deductions  involve  fitting  simple  models  of 
scene  fragments.  At  this  point,  there  is  finally  a  chance  for  high-level  mod¬ 
els  to  find  the  right  parts  of  the  scene  to  fit  and  we  already  know  enough 
about  the  lighting  to  know  what  would  be  in  shadow  and  what  would  be 
brightly  lit  (e.g.,  the  back  of  his  head). 

Besides  arguing  for  the  flow  chart  in  figure  7.1,  this  example  is  also 
useful  in  contrasting  Pattern  Theory  with  the  feedback  theory  of  Ullman 
(chapter  12).  Our  analysis  of  the  old  man  example  requires  multiple  in¬ 
dependent  and  concurrent  loops,  low-level  and  high-level,  some  modeling 
shading,  some  modeling  depth  planes,  some  modeling  clothed  bodies, 
and  some  modeling  faces.  This  suggests  that  Ullman's  theory  with  a  sin¬ 
gle  bottom-up  search  and  single  top-down  search  could  not  easily  solve 
the  old  man  puzzle.  Postulating  multiple  independent  feedback  loops, 
instead  of  one  global  feedback  from  stored  knowledge  to  the  sensorium, 
is  also  helpful  in  comparing  Pattern  Theory  with  Marr's  theory  of  vision 
(Marr  1982).  Marr  was  very  influenced  by  several  examples  in  which  top- 
down  information  was  either  not  needed  or  ignored  in  accomplishing  some 
feedforward  computational  task  (e.g.,  fusing  random-dot  stereograms  or 
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Figure  7.2  An  image  that  illustrates  some  difficulties  in  recognition.  {Top)  The  image.  (Bot¬ 
tom  left)  Canny's  edge  detector  applied  to  the  image.  ( Bottom  right )  A  blow-up  of  the  face 
showing  the  lack  of  recognizable  features. 


construction  of  3D  models  from  unorthodox  2D  views  by  victims  of  ag¬ 
nosia).  This  led  him  to  propose  a  purely  feedforward  theory  of  vision.  We 
would  argue  that  all  his  examples  are  evidence  against  strong  feedback  mod¬ 
els ,  like  Ullman's,  in  which  high-level  knowledge  is  fed  back  all  the  way  to 
low-level  stages,  and  that  none  of  his  examples  contradicts  the  hypothesis 
that  multiple,  more  local,  feedback  loops  are  being  used. 

Evidence  from  Neuroanatomy 

We  now  turn  to  the  cortex  itself  and  ask  whether  we  can  find  a  confirma¬ 
tion  in  its  structures  of  the  theory  that  bottom-up  pattern  analysis  cannot 
be  done  independently  of  top-down  pattern  synthesis.  Indeed,  one  of  the 
main  themes  in  neuroanatomy  in  the  last  several  decades  has  been  the 
discovery  that  the  cortex  is  naturally  divided  into  distinct  areas  that  are 
reciprocally  connected  by  pathways  created  by  the  axons  of  their  pyramidal 
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neurons.  Pattern  theory  strongly  suggests  that  these  pairs  of  pathways 
should  instantiate  the  dual  computational  processes  of  analysis  and  syn¬ 
thesis.  This  proposal  is  strongly  supported  by  the  still  emerging  picture  of 
the  cortical  layers  connected  by  these  pathways.  Some  of  these  pathways 
terminate  principally  in  layer  4,  the  standard  "input"  layer  for  bottom-up 
cortical  processing,  the  route  from  raw  sensory  input  to  higher  association 
areas:  it  is  natural  to  propose  that  these  pathways  carry  out  pattern  analy¬ 
sis.  Other  pathways  terminate  mostly  in  layers  1  and  6,  the  top  and  bottom 
of  the  cortical  plate,  and  are  typically  dual  to  the  first  set  (i.e.,  if  area  A  is 
connected  by  the  first  type  of  pathway  to  area  B,  then  one  of  the  second 
type  connects  area  B  back  to  area  A).  Pattern  Theory  suggests  that  these 
pathways  should  carry  out  pattern  synthesis. 

These  cortical  feedback  pathways  are,  perhaps,  the  most  complex  piece 
of  wiring  in  the  brain  and  it  is  astonishing  that  evolution  has  been  able  to 
create  them.  Does  their  evolution  support  our  proposal  that  all  cortico- 
cortical  pathways  should  belong  to  two  separate  systems,  a  bottom-up 
processing  pathway  and  a  top-down  processing  pathway?  The  homolo¬ 
gies  between  mammalian  neocortex  and  reptilian  telencephalic  structures 
are  not  obvious  and  there  has  been  much  debate  on  them.  One  set  of  ho¬ 
mologies  is  the  so-called  dual  origin  hypothesis,  which  goes  back  to  the 
pioneering  work  of  Marin-Padilla  (1978).  This  theory  has  been  developed 
by  Karten  and  most  recently  by  Deacon  (see  Karten  and  Shimizu  1989; 
Deacon  1990)  and  has  been  gaining  adherents.  It  proposes  that  the  six¬ 
layered  mammalian  neocortex  is  not  homologous  to  a  single  structure  in 
the  reptile,  but  that  two  structures,  separate  in  the  reptile,  have  become 
merged  in  the  mammal.  More  specifically,  (1)  the  top  and  bottom  layers  of 
the  mammalian  neocortex  when  originally  formed  in  the  embryo  are  ho¬ 
mologous  to  the  two-layered  dorsal  cortical  plate ,  or  pallium,  of  the  reptile, 
and  (2)  that  the  population  of  neurons  that  migrates  during  mammalian 
embryogenesis  to  form  the  inner  layers  of  the  neocortex  is  homologous  to 
the  neurons  of  the  dorsal  ventricular  ridge  in  the  reptile. 

This  theory  is  shown,  in  simplified  form,  in  figure  7.3.  What  Deacon  has 
pointed  out  is  that  this  theory  explains  beautifully  the  existence  of  recip¬ 
rocal  pathways  and  their  most  common  laminar  patterns  (Deacon  1990, 
pp.  686-691,  especially  last  paragraph).  Note  that  in  the  reptile,  there  are 
no  directly  reciprocal  pathways,  all  loops  being  longer  and  more  indirect. 
But  the  original  pallium  carries  its  own  internal  connections,  labeled  "A" 
in  figure  7.3,  many  of  which  emanate  from  the  olfactory  and  limbic  cortex 
and  proceed  caudally.  Moreover,  the  dorsal  ventricular  ridge  (DVR)  has 
its  internal  pathways  labeled  "B,"  which  proceed  rostrally.  When  in  the 
mammalian  embryo  the  homologous  structure  to  the  latter  migrates  inside 
the  homologous  structure  to  the  former.  Deacon  proposes,  because  of  the 
conservatism  of  evolution,  that  homologous  connections  will  still  be  estab¬ 
lished:  the  pathways  A,  descending  from  limbic  areas  and  synapsing  on 
layers  1  and  6,  the  residues  of  the  dorsal  cortical  plate,  are  still  laid  down 
and  become  the  top-down  pathways  of  the  mammal;  and  the  pathways  B, 
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Figure  7.3  A  comparison  of  the  main  structures  in  the  reptilian  (left)  and  mammalian  (right) 
brains,  illustrating  Marin-Padilla  and  Deacon's  theories  of  the  dual  origin  of  the  neocortex 
and  its  reciprocal  pathways  from  the  pallium  and  dorsal  ventricular  ridge. 


ascending  from  sensory  areas  in  the  DVR,  synapsing  in  the  middle  layers, 
become  the  bottom-up  pathways  of  the  mammal.  Moreover,  the  thalamo¬ 
cortical  reciprocal  pathways  arise  in  a  similar  way,  from  the  thalamus  — > 
DVR  pathway  B',  and  the  pallium  — *  thalamus  pathway  A'.  (We  have  sim¬ 
plified  the  picture  somewhat  by  excluding  the  geniculocortical  pathway 
and  its  precursor.) 

One  can  make  a  suggestive  link  of  Pattern  Theory  with  the  40-  to  60-Hz 
cortical  oscillations  that  have  been  observed  in  the  last  decade  in  so  many 
structures  in  so  many  distinct  recording  modes  (cf.  Singer,  chapter  10). 
The  link  is  the  proposal  that  this  oscillation  is  a  reflection  of  the  basic  cycle 
of  computation  in  which  bottom-up  features  are  compared  with  top-down 
memories  and  expectations,  of  the  iterative  operation  of  the  loop  in  figure 
7.1  (bottom).  The  strongest  evidence  for  this  is  the  observation  that  these 
oscillations  lock  when  the  cells  are  responding  to  linked  parts  of  the  stimu¬ 
lus,  both  in  different  parts  of  VI  and  between  VI  and  V2.  It  is  important  to 
realize  that  if  successive  cycles  of  this  oscillation  represent  successive  itera¬ 
tions  in  a  computation,  one  would  not  expect  exactly  the  same  cells  to  par¬ 
ticipate  in  each  cycle.  Therefore,  the  oscillation  would  be  much  stronger  in 
field  potentials  than  in  single  cell  recordings.  This  is  exactly  what  is  found. 
For  instance,  field  potential  oscillations  were  discovered  by  Freeman  in  the 
1970s  (Freeman  1975)  in  the  olfactory  bulb  and  cortex.  It  is  interesting  to 
note  that  one  form  they  take  here  is  repeated  sweeps  of  rostral-to-caudal 
excitation,  as  though  the  two  poles  of  the  olfactory  bulb  are  like  two  neo- 
cortical  areas  communicating  and  oscillating  via  long  axons  (compare  the 
model  of  Wilson-Bower  1992).  The  oscillation  even  shows  up  on  the  entire 
cortex  in  human  MEG  recordings:  see  Llinas  et  al.  (1991),  which  shows  a 
40-Hz  oscillation  sweeping  over  the  whole  cortex  and  Ribary  et  al.  (1991), 
which  shows  the  oscillation  between  cortical  and  thalamic  activity. 
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If  we  make  a  crude  connectivity  model  of  the  type  of  circuit  that  emerges 
from  this  analysis,  what  does  it  look  like?  Is  it  like  the  "blackbox"  compu¬ 
tational  models  that  have  long  been  the  staples  of  the  computer  metaphor 
(figure  7.4,  top)?  These  diagrams  stem  from  50  years  of  development  of  the 
computer,  starting  from  Von  Neumann.  The  major  computational  steps  are 
carefully  dissected  and  put  in  separate  boxes,  necessary  data  flow  paths 
are  added,  and  the  whole  thing  operates  like  a  chemical  factory.  This 
point  of  view  is  highly  developed  in  the  books  of  Fodor  and  Kosslyn;  its 
computational  foundation  has  been  beautifully  expounded  in  the  book  by 
Abelson  and  Sussman  (1985).  But  this  is  not  what's  there!  In  the  cortex, 
roughly  65%  of  all  cells  are  pyramidal  cells  that  send  their  output  to  dis¬ 
tant  cortical  areas,  as  well  as  locally  via  their  axon  collaterals.  This  means 
that  there  is  no  hiding  of  local  information,  no  "local  variables"  or  protected 
data.  A  better  picture  is  figure  7.4,  bottom.  Instead  of  black  boxes  with 
opaque  walls,  we  have  apartments  in  a  cheap  housing  complex  with  very 
thin  walls!  All  your  neighbors  hear  everything  that  is  going  on  in  your 
home.  Instead  of  "hiding  local  variables,"  a  device  central  to  all  modular 
programming,  every  little  whimsy  that  occurs  to  you  goes  out  instantly  to 
all  and  sundry 

It  seems  to  me  that  the  computational  metaphor  itself  is  flawed.  Pattern 
Theory  has  a  clear  explanation:  these  tightly  coupled  cortical  areas  are  ex¬ 
actly  the  higher  and  lower  level  areas  of  pattern  theory  that  seek,  by  a  sort 
of  relaxation  algorithm,  to  come  to  a  mutual  understanding  in  which  the 
lower  area's  more  concrete  data  are  fit  with  a  known,  more  abstract,  cate¬ 
gory  expressed  by  the  higher  area's  activity.  This  is  a  fundamental  shift  in 
focus  from  the  computational  metaphor.  Just  as,  for  instance,  Edelman  has 
proposed  Darwinian,  evolutionary  metaphors  as  the  right  ones  for  mod¬ 
eling  brain  function  (cf.  Edelman  1987),  similarly  pattern  theory  implies 
a  new  paradigm:  that  of  many  different  parts  of  the  brain  attempting  to 
reconcile  their  states,  their  implicit  descriptions  of  part  of  reality,  with  the 
states  of  other  areas,  either  through  bottom-up  assertions  of  facts  that  have 
to  be  dealt  with  or  top-down  memories  of  expected  patterns.  This  is  re¬ 
lated  to  Minsky's  idea  of  the  brain  consisting  of  many  agents,  in  "Society 
of  Mind"  (Minsky  1985). 

Does  all  this  speculation  mean  anything  for  the  experimenter?  Does  it 
have  any  predictive  force?  To  begin  with,  it  implies  that  there  will  be  more 
correlation  between  single-cell  responses  in  different  areas  than  would  be 
expected  if  the  areas  were  black  boxes,  hiding  their  characteristic  internal 
computations  from  each  other.  For  instance,  we  see  this  in  the  tremendous 
overlap  of  the  characteristics  of  single  cells  in  the  various  visual  areas, 
which  has  prevented  assigning  any  clear  functional  role  to  V3  or  V4  (aside 
from  generalities  like  being  concerned  with  shape  or  color).  What  we  think 
is  the  most  important  implication,  however,  depends  on  a  refinement  of 
multiple  cell  recording  techniques:  Pollen  has  proposed  a  technique  for 
preparing  an  animal  with  electrodes  recording  from  cells  in  two  areas  that 
show  significant  cross-correlation  in  their  spiking  (cf.  preliminary  work 
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Figure  7.4  (Top)  The  modular  approach  to  cognition  and  computation,  in  which  individual 
steps  are  carried  out  "privately"  and  only  final  results  are  broadcast.  (Bottom)  The  relaxation 
approach  of  the  cortex,  in  which  two-thirds  of  all  neurons  send  their  output  both  locally  and 
to  distant  areas. 


by  Liu  et  al.  1992).  At  this  point,  instead  of  looking  at  the  responses  of  the 
two  cells  in  isolation,  one  can  separate  for  analysis  the  correlated  spikes  from  the 
full  spike  trains.  The  theory  suggests  that  this  set  of  spikes  may  be  much 
less  stochastic,  carrying  the  information  transmitted  between  areas,  and 
hopefully  correlated  much  more  precisely  and  predictably  with  identifiable 
aspects  as  the  stimulus.  To  be  more  specific,  we  must  turn  to  what  the 
theory  conjectures  about  the  content  and  nature  of  the  representation  in 
individual  areas  and,  using  this,  its  description  of  the  data  transmitted 
back  and  forth  between  areas. 
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THE  FOUR  BASIC  DEFORMATIONS 


What  They  Are 

The  second  basic  idea  of  Pattern  Theory  is  that  the  processes  that  trans¬ 
form  the  world  variables  into  the  observed  variables  are  not  arbitrarily 
complicated,  varying  widely  from  one  channel  to  another.  Instead,  four 
basic  transformations,  or  deformations  as  Grenander  called  them,  can  be 
found  at  work  in  every  channel.  These  are  the  following: 

1.  Noise  and  blur.  These  effects  are  the  basis  of  standard  signal  processing, 
caused,  for  instance,  by  sampling  error,  background  noise,  and  imperfec¬ 
tions  in  your  measuring  instrument  such  as  imperfect  lenses,  veins  in  front 
of  the  retina,  dust,  and  rust.  Typically,  the  full  real  world  signal  is  mea¬ 
sured  only  at  discrete  sample  points;  its  value  at  each  point  gets  averaged 
with  its  neighbors — this  is  blur — and  corrupted  by  the  addition  of  some 
unknown  noisy  factors.  In  more  cognitive  applications,  like  the  medical 
expert  systems,  errors  in  tests,  the  inadequacy  of  language  in  conveying  the 
nature  of  some  pain  or  symptom,  confusing  extraneous  factors,  all  belong 
to  this  class. 

2.  Multiscale  superposition.  Signals  typically  reveal  one  set  of  structures 
caused  by  one  set  of  phenomena  in  the  world  when  analyzed  locally,  at 
high  precision,  and  other  structures  and  phenomena  when  analyzed  glob¬ 
ally  and  coarsely,  at  low  precision.  For  instance,  in  images,  local  proper¬ 
ties  include  sharp  edges,  texture  details,  and  local  irregularities  of  shapes, 
which  coexist  with  global  properties  like  slowly  varying  shading  or  texture 
statistic  gradients  and  the  overall  shape  of  an  object.  In  speech,  information 
is  conveyed  by  the  highest  frequency  formants,  by  the  lower  frequency  vi¬ 
bration  of  the  vocal  cords  and  the  even  slower  modulation  of  stress.  These 
spatial  or  temporal  frequency  bands  may  be  combined  additively  (as  in 
Fourier  analysis  or  wavelets),  multiplicatively  (as  in  AM  coding)  or  by 
more  complex  nonlinear  rules.  In  higher  order  processing,  the  analog  of 
this  decomposition  into  the  "overall"  shape  versus  fine  local  detail  is  the 
hierarchical  model  of  concepts  embodied  in  semantic  nets.  These  mod¬ 
els  describe  a  situation  partly  by  its  general  properties,  the  very  inclusive 
superordinate  categories  (in  the  terminology  of  Rosch  1978)  to  which  it  be¬ 
longs,  and  partly  by  its  details,  the  subordinate  categories  of  Rosch.  Thus 
a  patient  is  in  simplest  terms  "very  ill";  in  more  precise  terms  the  patient 
has  pneumonia,  is  contagious,  should  be  hospitalized,  and  in  very  precise 
terms  is  infected  by  such  and  such  a  bacteria,  has  a  temperature  of  103,  etc. 

3.  Domain  warping.  Two  signals  generated  by  the  same  object  or  event  in 
different  contexts  typically  differ  because  of  expansions  or  contractions  of 
their  domains,  possibly  at  varying  rates:  phonemes  may  be  pronounced 
faster  or  slower,  the  image  of  a  face  is  stretched  or  shrunk  by  varying 
expression  and  viewing  angle.  In  speech,  this  is  called  "time  warping"  and 
in  vision  this  is  modeled  by  "flexible  templates."  In  both  cases,  there  is  a 
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mapping  from  the  domain  of  one  signal  to  the  domain  of  the  other,  either  a 
map  of  time  intervals  or  a  map  between  two-dimensional  domains,  which 
carries  the  salient  parts  of  one  signal  to  the  corresponding  parts  of  the  other. 
The  cognitive  version  of  this  type  of  distortion  is  thinking  with  analogies. 
In  an  analogy,  some  or  all  the  elements  in  two  situations  can  be  mapped 
to  each  other,  preserving  many  of  their  interrelations,  just  as  the  same 
elements  occur  in  two  faces,  with  nearly  the  same  spatial  relationships.  In 
all  cases,  the  map  may  be  incomplete,  in  that  some  parts  of  one  situation 
may  not  have  corresponding  parts  in  the  other.  Thus  one  face  may  be 
partially  obscured  by  hair  or  a  bandage,  and  only  the  unoccluded  parts 
match  up. 

4.  Interruptions.  Natural  signals  are  usually  analyzed  best  after  being  bro¬ 
ken  up  into  pieces  consisting  of  their  restrictions  to  subdomains.  This  is 
because  the  world  itself  is  made  up  of  many  objects  and  events  and  different 
parts  of  the  signal  are  caused  by  different  objects  or  events.  For  instance, 
an  image  typically  shows  different  objects  partially  occluding  each  other 
at  their  edges.  In  speech,  the  phonemes  naturally  break  up  the  signal  and, 
on  a  larger  scale,  one  speaker  or  unexpected  sound  may  interrupt  another. 
Obviously,  in  the  cognitive  realm  too,  several  processes  may  be  at  work  at 
once,  as  in  a  patient  who  has  several  medical  problems  at  once.  To  infer 
the  correct  values  of  the  hidden  variables,  the  effects  of  the  different  pro¬ 
cesses  must  be  separated  from  each  other.  A  general  term  for  isolating  the 
effects  of  one  process,  object,  or  event  from  all  the  myriad  others  going  on 
simultaneously  is  figure/ground  separation. 

This  part  of  Pattern  Theory  has  a  great  deal  to  say  to  neuronal  models.  If 
these  four  transformations  are  universal  coding  mechanisms,  which  must 
be  decoded  by  a  brain,  there  should  be  mechanisms  for  all  of  them  if  you 
look  in  the  right  way.  If  they  are  truly  universal,  these  mechanisms  should 
be  general  circuits  that  occur  in  all  areas  of  cortex.  This  is  the  challenge 
of  Pattern  Theory.  We  will  discuss  in  separate  sections  below  possible 
neuronal  correlates  of  deformations  1,  3,  and  4. 

Deformation  2,  multiscale  superposition,  has  often  been  discussed  for 
vision  as  the  "pyramid"  data  structure  and  associated  algorithms  often 
using  a  moving  window  of  attention.  It  was  only  at  this  meeting,  however, 
that  we  heard  Van  Essen  and  Anderson  propose  how  such  a  pyramid  could 
be  laid  out  cortically  using  the  three  areas  VI,  V2,  and  V4  (see  chapter  13). 
We  will  not  discuss  the  decoding  of  this  deformation  except  to  mention 
that  one  of  the  major  computations  using  a  pyramid  is  the  discovery  of 
the  "part-of"  relations  between  blobs  of  different  sizes  (for  instance,  as  a 
step  to  recognition  of  complex  objects,  e.g.,  Hong  and  Rosenfeld  1984). 
Striking  evidence  that  this  is  done  by  the  recognition  of  small  and  large 
blobs  in  parallel ,  with  hard-wired  "part-of  "  connections,  was  recently  found 
by  Jeremy  Wolf  (unpublished),  who  found  that  (1)  red  houses  with  yellow 
windows  pop-out  in  a  field  of  differently  colored  houses  and  windows, 
while  (2)  duplex  half  red  and  half  yellow  houses  do  not  pop-out!  I  believe 
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this  strongly  supports  Anderson  and  Van  Essen's  theory,  because  it  can  be 
explained  by  the  concurrent  recognition  of  the  red  houses  in  V2  and  yellow 
windows  in  VI  with  reciprocal  VI,  V2  pathways  marking  "part-of  "  rapidly 
strengthening  the  activation  to  threshold. 

Nonlinear  Filtering  and  the  Thalamocortical  Loop 

Let  us  look  at  the  lowest  level  loops  in  the  circuitry  of  the  cortex  and 
its  immediate  neighbors.  The  most  basic  of  these  are  the  loops  connect¬ 
ing  various  cortical  areas  with  various  nuclei  in  the  thalamus,  especially 
the  loop  between  visual  area  VI  and  the  LGN.  In  many  cases,  these  give 
primary  sensory  input  to  the  cortex  and  a  natural  idea,  in  the  context  of 
pattern  theory,  is  that  these  would  be  concerned  with  correcting  for  the 
most  basic  "deformation"  of  the  sensory  signal — noise  and  blur.  For  in¬ 
stance,  Grossberg  has  often  pointed  out  that  the  visual  signal  coming  from 
the  retina  must  be  distorted  by  the  presence  of  veins  on  the  inner  surface 
of  the  retina,  not  to  mention  the  blind  spot  itself.  Ever  since  Yarbus  (1967), 
it  has  been  known  that  within  each  fixation,  the  eye  is  far  from  still,  but 
drifts  irregularly,  with  a  constant  tremor  of  several  minutes  of  arc  (enough 
to  move  sharp  edges  across  several  adjacent  cones  in  the  fovea).  In  addi¬ 
tion,  the  light  signal,  as  it  strikes  the  eye,  is  already  the  result  of  conflicting 
processes  that  obscure  its  origin:  the  "accidental"  markings  on  textured 
surfaces  obscure  their  shape,  and  lighting  effects  are  complicated  by  local 
self-shadowing  and  mutual  reflections.  Although  part  of  the  rich  com¬ 
plexity  of  the  world,  they  act  like  noise  and  blur  if  you  are  attempting  to 
reconstruct  the  outlines  of  the  major  objects  in  view. 

For  many  years,  engineers  have  proposed  appropriate  filtering  as  the 
universal  solution  to  the  problem  of  compensating  for  noise  and  blur.  But 
pattern  theory  would  propose  that,  like  the  other  types  of  deformations, 
they  must  be  corrected  for,  not  by  a  blind  bottom-up  filter,  but  by  an  adap¬ 
tive  feedback  process.  This  is  a  logical  role  to  propose  for  the  thalamo¬ 
cortical  loop.  Specifically,  the  reciprocal  LGN  ^  VI  pathways  should 
implement  an  image  processing  algorithm,  which  "cleans  up"  and  disam¬ 
biguates  the  visual  signal.  Typical  functions  of  image  processing  are  noise 
removal  and  edge  enhancement.  No  wonder  single  cell  recordings  could 
never  find  any  role  for  the  VI  — ►  LGN  feedback:  the  squeaky  clean  labo¬ 
ratory  signals,  with  edges,  bars,  and  sine  wave  gratings  do  not  need  any 
image  processing!  Experimental  tests  for  this  hypothesis  are  easy  to  draft, 
once  one  is  committed  to  presenting  more  complex  and  realistic  stimuli, 
for  which  the  response  cannot  be  summarized  by  linear  approximations, 
like  the  impulse  transfer  function.  Several  such  proposals  are  presented  in 
Mumford  (1991, 1992). 

How  are  these  image  processing  tasks  accomplished?  We  assume  that 
the  complex  cells,  whose  response,  to  a  first  approximation,  is  like  a  power 
Gabor  filter  with  a  preferred  scale  and  orientation,  attempt  to  find  the 
salient  edges  and  bars  in  an  image.  But  typically,  many  of  these  will  be 
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responding  simultaneously  in  each  local  region  and  one  must  find  how  to 
reconcile  them  (e.g.,  one  such  cell  "sees"  a  strong  long  line,  the  other  an 
edgelet  that  is  part  of  texture;  or  one  marks  the  end  of  a  bar,  the  other  its 
sides).  Before  a  consistent  interpretation  is  found  for  each  part  of  an  image, 
many  conflicting  local  organizations  may  be  detected  and  there  is  a  need 
for  some  kind  of  decision  mechanism  such  as  a  "winner-take-all"  circuit. 

There  are  several  hints  of  such  decision  mechanisms  in  the  cortico¬ 
thalamic  projection.  Several  groups  (cf.  McGuire  et  al.  1984  in  cat.  White 
and  Keller  1987  in  mouse)  have  reported  that  the  axon  collaterals  of  the 
layer  6  VI  pyramidal  cells  and  especially  the  corticothalamic  projection 
cells  appear  to  synapse  largely  on  aspinous  intemeurons,  presumably  in¬ 
hibitory  cells.  This  has  the  look  of  a  winner-take-all  network,  an  organiza¬ 
tion  long  predicted  in  the  neural  net  literature,  but  never  clearly  identified 
in  the  cortex  to  our  knowledge.  Alternately,  the  inhibitory  cells  in  the  LGN 
could  provide  a  voting  mechanism.  In  other  words,  if  these  were  absent, 
the  various  feedback  signals  from  cortex  would  simply  be  averaged  in 
the  dendritic  trees  of  LGN  "relay"  cells.  But  if  some  of  them  synapse  on 
inhibitory  cells,  they  can  effectively  suppress  other  feedback  and  feedfor¬ 
ward  signals. 

Shifter  Circuits,  Flexible  Templates,  and  Population  Coding 

A  more  radical  part  of  the  pattern  theory  analysis  is  the  proposal  that  do¬ 
main  warping  is  a  universal  deformation.  This  means  that  in  analyzing 
signals,  and  matching  signals  against  patterns  in  memory,  the  pattern  of  activity 
on  the  cortex  must  be  displaced  (in  the  two-dimensional  coordinates  of  the  cortical 
surface).  Such  operations  have  been  proposed  under  the  name  of  "shifter 
circuits,"  most  recently  in  Anderson  and  Van  Essen  (1987).  Although  ar¬ 
gued  for  by  theorists  for  some  time,  only  recently  has  evidence  appeared 
for  their  existence  in  cortex.  In  a  beautiful  paper  on  recordings  in  the  pari¬ 
etal  lobe,  Duhamel  et  al.  (1992)  found  that  activity  correlated  to  the  visual 
location  of  different  objects  in  front  of  an  awake  monkey  is  shifted  on  the 
parietal  lobe  surface  in  anticipation  of  a  saccade  that  will  shift  the  visual 
sensory  signal.  In  a  totally  different  part  of  the  cortex,  Georgopoulos  and 
his  group  have  found  that  activity  in  the  primary  motor  area  Ml  is  shifted 
as  the  precise  coordinates  of  an  intended  arm  movement  are  computed. 
Note  that  this  example  is  not  sensory  but  motor-planning  related:  here  the 
activity  pattern  for  one  standard  arm  movement  is  first  recreated  in  Ml, 
and  then  it  is  modified  over  a  100-msec  period  by  domain  warping  until  it 
is  appropriate  for  the  specific  movement  presently  desired. 

The  simplest  example  where  there  is  a  need  for  this  mechanism  is  in 
the  computations  of  stereo  vision,  in  the  correlation  of  the  left  and  right 
eye  movement.  This  example  was  used  by  Anderson  and  Van  Essen  and 
by  Poggio.  As  they  point  out,  what  makes  it  especially  compelling  is  the 
existence  of  tremors  in  eye  position  of  up  to  10  min  of  arc  during  a  period 
of  fixation:  without  active  compensation  for  this,  stabilizing  the  image. 
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it  is  very  hard  to  imagine  how  stereo  cells  in  VI  can  respond  robustly  to 
left/right  eye  feature  disparities  of  only  several  minutes,  let  alone  account 
for  the  psychophysical  evidence  of  disparity  hyperacuity  of  less  than  a 
minute  of  arc.  Anderson  and  Van  Essen  propose  that,  in  the  primate,  this  is 
carried  out  by  shifter  circuits  in  VI  that  has  developed  a  highly  specialized 
layer  4,  making  it  unique  for  its  cell  density  among  mammalian  cortical 
areas.  In  the  less  specialized  case  of  the  cat,  we  would  propose  that  this 
stabilization  results  from  the  action  of  the  LGN  ^  VI  loop,  rather  than  a 
hard-wired  shifter  circuit  in  VI  (and  that  this  circuit  is  reprentative  of  the 
general  mechanism  used  to  implement  domain  warping). 

What  neural  circuitry  could  accomplish  this?  In  figure  7.5,  we  make  a 
simple  proposal.  We  suggest  that  (in  the  cat)  each  retinal  ganglion  cells' 
axons  synapse  on  multiple  LGN  "relay"  cells  and  that  populations  of  such 
cells  synapse  in  overlapping  ways.  Thus  one  LGN  cell  receives  input  from 
multiple  retinal  cells,  but  on  distinct  branches  of  its  dendritic  tree.  Nor¬ 
mally  one  of  these  is  the  strongest  and  that  retinal  cell  takes  charge  of  that 
particular  LGN  cell.  But  under  cortical  influence,  both  excitatory  and  in¬ 
hibitory,  some  of  these  synapses  can  be  strengthened  and  some  weakened 
by  local  postsynaptic  potentials  on  the  different  branches  of  its  dendritic 
arbor.  This  could  be  done  by  a  variety  of  mechanisms,  including  NMDA 
channels.  In  the  figure,  we  have  drawn  one  possibility  using  inhibitory 
effects,  caused  by  the  dendrodendritic  triadic  synapses  with  inhibitory 
glomerular  intemeurons.  Following  Sherman  and  Koch  (1990,  pp.  256- 
266),  we  have  assumed  that  this  interaction  takes  place  on  spines  of  the 
"relay"  cell,  where  the  retinal  and  glomerular  inputs  are  combined  in  a 
synaptic  triad  functioning  like  an  “x  and  NOT  y”  gate.  The  effect  is  that 
each  LGN  cell  is  driven  by  a  different  retinal  cell  and  the  pattern  of  activa¬ 
tion  is  shifted  in  the  LGN.  Note  that  such  shifts  must  be  vertical  as  well  as 
horizontal,  as  evidence  (cf.  Motter  and  Poggio  1984)  shows  that  the  two 
eyes  are  usually  misaligned  vertically  by  5  to  10  min  of  arc.  This  shifting 
can  accomplish  two  things  at  once:  it  can  compensate  for  tremor  and  mis¬ 
alignment  and  it  can  create  a  simulated  vergence  movement  to  align  more 
closely  the  left  and  right  eye  images,  thus  reducing  the  disparity  of  the  sig¬ 
nal  received  by  VI  so  that  the  exquisitely  sensitive  "tuned  excitatory  cells" 
of  VI  can  measure  extremely  fine  residual  disparities.  One  prediction  that 
this  makes  is  that  the  left  and  right  eye  layers  of  the  LGN  should  interact 
through  cortical  feedback.  Varela  and  Singer  (1987)  show  that  this  does  hap¬ 
pen,  and,  even  more  interesting,  if  the  left  and  right  eyes  are  stimulated 
with  radically  conflicting  signals,  which  cannot  be  put  in  binocular  regis¬ 
tration,  then  the  LGN  "relay"  signals  decrease  markedly  after  about  1  sec. 

At  all  levels  of  the  cortex,  there  is  a  need  to  shift  patterns  of  activation 
in  order  to  find  matches  between  memories  and  expectations  and  the  par¬ 
ticularities  of  the  present  situation:  a  very  concrete  example  is  the  need  to 
recognize  a  familiar  face  with  any  of  the  millions  of  possible  combinations 
of  viewpoint,  lighting,  and  expression  that  can  occur.  Shifter  circuits  can 
accomplish  this  and  we  propose  that  this  shifting  is  accomplished  in  general  by 
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TO  VI 
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FROM  VI 


FROM  RETINA 


Figure  7.5  A  possible  implementation  of  shifter  circuits  in  the  LGN:  VI  feedback  excites  in¬ 
hibitory  glomerular  intemeurons  that  combine  with  retinal  input  in  “x  and  NOT y"  trisynaptic 
connections  on  the  LGN  relay  cells. 

the  extensive  arborization  of  the  feedback  pathways ,  selectively  exciting  and  in¬ 
hibiting  the  collateral  spread  of  activity  in  a  given  cortical  area .  This  is  the  natural 
generalization  of  the  circuits  in  figure  7.5.  Rockland's  beautiful  tracings 
of  the  axons  of  recurrent  axons  have  shown  how  amazingly  diverse  and 
extensive  their  arborizations  can  be  (cf.  Rockland  and  Virga  1989, 1993). 

From  an  evolutionary  perspective,  we  can  contrast  this  with  what  hap¬ 
pens  in  the  reptile.  The  reptile  has  a  more  or  less  rigid  body  and  its  tectum 
contains  maps  of  its  visual,  auditory,  and  somatosensory  systems  in,  more 
or  less,  hard-wired  registration.  In  such  a  structure,  the  sensory  systems 
are  forced  to  combine  their  data  with  very  little  flexibility.  In  contrast, 
mammalian  cortex  has  a  unique  flexibility  due  to  the  separation  and  du¬ 
plication  of  cortical  mappings.  It  should  be  noted  that  the  existence  of 
multiple  sensory  maps  is  not  particular  to  higher  mammals,  but  is  uni¬ 
versal  in  mammalian  neocortex,  even  in  the  evolutionary  side  branches  of 
marsupials  and  edentates  (e.g.,  essentially  all  mammals  have  a  homolog 
of  both  VI  and  V2  [Kaas  1989]).  To  some  extent,  this  may  be  a  response 
to  the  increased  flexibility  of  the  trunk,  especially  the  neck,  and  the  de¬ 
velopment  of  limbs,  which  require  that  the  animal  have  the  capability  of 
combining  visual,  auditory,  and  tactile  sensory  data  in  flexible  ways.  But 
it  also  affords  new  computational  capabilities:  in  particular,  the  sensory 
maps  in  distinct  areas  can  be  dynamically  interleaved,  creating  the  domain 
warping  needed  for  much  more  sophisticated  pattern  matching. 

An  objection  to  these  ideas  is  that  only  in  primary  and  (to  a  lesser  extent) 
secondary  sensory  and  motor  areas  can  one  find  a  coherent  meaning  to  the 
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two-dimensional  cortical  layout  of  activation.  In  higher  cognitive  areas, 
it  is  very  hard  to  imagine  why  abstract  thoughts  should  have  any  two- 
dimensional  structure  or  why  shifting  patterns  of  activation  on  the  cortical 
surface  would  be  useful!  We  think  the  answer  to  this  paradox  lies  in 
another  biological  principle,  which  is  strongly  at  odds  with  traditional 
cognitive  modeling.  This  is  population  coding:  many  experiments  reveal 
that  the  brain  does  not  store  facts  cleanly  and  discretely,  with  one  neuron 
firing  for  one  possibility,  another  for  a  second,  etc.  Instead,  there  is  a 
graded  pattern  of  firing  for  each  possibility,  but  with  shifting  strengths 
(possibly  with  coherent  pulse  timing  too)  for  each  situation.  It  seems  to 
me  that  this  is  directly  connected  to  the  linguistic  fact  that  the  meaning 
of  individual  words  in  human  languages  is  not  simple  and  clean  either: 
words  always  cover  a  great  variety  of  different  related  situations.  This 
is  exactly  what  you  would  expect  if  language  reflects  the  way  neurons 
fire,  and  if  higher  level  concepts  are  population  coded  like  sensory  and 
motor  signals.  But  a  corollary  of  population  coding  is  that  the  set  of  higher 
level  concepts  will  automatically  have  geometric  structure.  This  is  because  two 
concepts  can,  at  one  extreme,  excite  nearly  identical  populations  with  a 
small  change  in  the  degree  of  excitation  and  the  marginally  excited  neurons; 
and,  at  the  other  extreme,  can  excite  totally  disjoint  populations.  We  may 
thus  define  the  distance  between  two  concepts  by  the  degree  of  overlap  of 
their  representations,  or  the  correlation  of  the  vectors  of  neuron-by-neuron 
excitations  that  each  concept  arouses. 

Chapter  4  by  Desimone  et  al.  describes  experiments  in  area  IT  that  fit 
nicely  into  this  theory.  Their  data  suggest  that  perhaps  in  many  cortical 
areas,  there  is  a  tendency  to  form  more  and  more  localized  responses  to 
exactly  repeating  stimuli  (this  is  shown  negatively  by  the  large  numbers  of 
cells  whose  responses  decrease  to  repetitions).  In  other  words,  the  cells  of 
a  certain  class  tend  to  specialize  in  responding  to  very  precise  patterns.  If  a 
category  is  formed  by  a  cluster  of  such  precise  instances,  we  will  naturally 
get  a  graded  population  response  to  new  instances  of  this  category,  because 
it  will  resemble  to  a  greater  and  lesser  degree  each  of  the  previously  learned 
instances. 

One  construct  that  has  often  been  suggested  in  this  connection  is  to 
make  a  graph  out  of  the  set  of  all  concepts,  or  the  set  of  all  English  words. 
The  edges  of  the  graph  represent  the  most  closely  connected  concepts. 
Such  a  graph  was  proposed,  for  instance,  in  Quillian  (1967)  under  the 
name  of  an  associative  net.  Perhaps  the  earliest  attempt  to  do  this  with 
a  whole  language  was  the  Thesaurus  of  Roget,  which  is  precisely  such 
a  graph.  Bell  labs  put  this  graph  "on-line."  They  found  curious  facts 
such  as  that  the  average  number  of  edges  needed  to  join  a  word  to  its 
antonym  is  5  or  6!  A  quite  curious  graph  is  formed  in  Dixon's  analysis  of 
the  five  word  classes  in  the  Australian  aboriginal  language  Dyirbal.  These 
appear  to  be  clusters  gotten  by  stringing  together  related  concepts  in  long 
chains  (cf.  Lakoff  1987,  pp.  92-102).  Once  we  have  this  graph,  we  can 
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talk  about  domain  warping  in  situations  involving  high-level  cognitive 
data  structures  and  we  find  that  it  corresponds  to  well-known  cognitive 
operation:  namely  finding  analogies,  finding  a  mapping  between  two  sets 
of  concepts  that  bear  the  same  mutual  relation  to  each  other.  Matching  a 
heavily  shadowed  face  in  front  of  you  with  your  memory  of  general  face 
structure  is  the  same  warping  of  a  template  that  is  accomplished  when  you 
match  your  knowledge  of  the  Pope  with  the  general  concept  of  a  bachelor. 
Our  proposal  is  that  what  happens  neurally  when  you  analyze  the  sentence 
"The  Pope  is  a  bachelor"  (a  classic  example  of  philosophers)  is  that  one 
cortical  area  with  a  "bachelor"  template,  stored  with  all  sorts  of  typical 
properties,  activities,  life  histories  of  bachelors  attempts  to  fit  this  activity 
pattern  to  the  specific  data  conjured  up  in  a  second  area  describing  the 
Pope  and  his  properties,  activities,  and  life  history.  A  partial  match  can  be 
achieved,  after  suitable  warping  of  the  archetype.  This  will  also  highlight 
the  nonmatching  qualities  (e.g.,  the  Pope  does  not  date),  which  is  what  we 
want  to  look  at  next. 

Interruptions  and  Foreground/Background  Coloring 

We  want  to  consider  the  fourth  type  of  deformation  in  pattern  theory,  in¬ 
terruptions.  Recall  that  this  refers  to  the  fact  that  we  are  bombarded  by 
signals  from  many  different  objects  and  events  at  any  given  instant  and  all 
contribute  to  the  activity  being  received  and  processed  by  the  brain.  We 
must  locate  the  boundaries  between  these  objects  or  events,  so  we  can  iden¬ 
tify  them  one  at  a  time.  To  do  this,  we  have  to  label  or  "color"  explicitly 
the  parts  of  the  present  activity  pattern  that  result  from  this  foreground 
object  or  event,  suppressing  for  the  time  being  the  rest  as  background. 
From  the  point  of  view  of  single  cell  activity,  this  is  very  mysterious:  each 
cell  is  population  coded  and,  via  its  collaterals,  there  is  a  tendency  for  a 
spread  of  activation.  What  we  need  is  a  mechanism  to  say  a  and  b  are 
linked  but  NOT  c.  Much  has  been  said  about  this  issue,  under  the  names 
of  dynamic  linking,  compositionality  of  concepts,  etc.  In  particular.  Singer 
has  argued  forcefully  for  synchrony  of  pulse  timing  as  a  possible  mecha¬ 
nism  (see  chapter  10).  In  the  context  of  pattern  theory,  the  key  thing  is  that 
whatever  mechanism  is  used,  it  must  involve  correlating  activity  in  reciprocally 
connected  areas .  This  is  because  only  by  separating  foreground  from  back¬ 
ground  can  the  features  of  the  foreground  be  extracted  without  confusing 
them  with  those  of  the  background.  Pattern  theory  proposes  that  this  is 
done  iteratively:  a  preliminary  foreground /background  separation  leads 
to  a  preliminary  computation  of  features,  hence  to  a  preliminary  identifi¬ 
cation,  then  by  feedback  a  refined  foreground/background  separation,  etc. 

We  would  like  to  discuss  a  very  simple  specific  case  of  this  problem, 
which  has  been  extensively  studied  in  computer  vision:  the  segmentation 
of  a  two-dimensional  visual  signal  into  distinct  objects.  Our  discussion 
of  the  "Old  Man"  example  (earlier)  shows  that  many  processes  contribute 
to  segmentation.  (That  example  dealt  with  a  photograph,  hence  it  omit- 
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ted  stereo  and  motion  that,  in  real  life,  are  extremely  effective  additional 
processes  in  segmentation.)  One  of  the  processes  we  discussed  was  the 
linking  of  interrupted  edges  and  the  clustering  of  similarly  textured  blobs, 
with  the  preliminary  goal  of  segmenting  the  image  into  homogeneously 
textured  areas.  Our  hypothesis  is  that  this  segmentation  is  the  main  goal  of 
one  or  both  of  the  VI  ^  V2  V4)  feedback  loops.  Note  that  in  the  theory 
of  Anderson  and  Van  Essen,  these  are  the  areas  holding  a  pyramid-based 
description  of  the  image;  in  their  terms,  our  hypothesis  is  that  segmenta¬ 
tion  is  the  main  internal  computational  goal  of  this  pyramid  (in  its  loops 
with  higher  areas,  V4  may  participate  in  other  things,  like  the  computation 
of  shape  features  for  identification  of  objects).  Two  quite  different  math¬ 
ematical  discussions  of  the  segmentation  problem  can  be  found  in  Hong 
and  Rosenfeld  (1984),  which  uses  a  pyramid-based  dynamic  linking  algo¬ 
rithm,  and  Lee  et  al.  (1992),  which  uses  Bayesian  methods  of  combining 
edge  and  region  data. 

There  are  two  very  specific  things  to  look  for  if  this  computation  is  going 
on.  The  first  is  the  need  to  trace  extended  edges,  that  surround  the  objects 
in  the  scene.  Simple  Gabor-filter-like  cells  do  not  do  this:  they  are  misled  by 
gaps  in  edges,  small  texture  responses,  blur,  and  local  shadows.  Lateral 
inhibition,  which  is  known  to  occur  for  a  subpopulation  of  complex  cells, 
is  the  first  step  in  finding  the  important  edges,  as  this  will  often  distinguish 
region  boundaries  from  texture  edges.  Filling  gaps  and  finding  alignments 
of  edge  terminators,  as  von  der  Heydt  has  shown  is  done  in  V2  (von  der 
Heydt  and  Peterhans  1989),  is  another  step.  But  all  this  information  must 
be  put  together.  A  strong  suggestion  that  the  the  V2  — ►  VI  feedback  may  be 
involved  was  found  recently  by  Mignard  and  Malpeli  (1991):  they  found 
that  vigorous  upper  layer  activity  in  VI  can  be  sustained  by  feedback  from 
V2  in  the  absence  of  direct  stimulation  from  the  LGN  — >  VI,  layer  4  pathway. 
It  is  possible  that  V2  — *  VI  paths  carry  a  reconstruction  of  the  extended 
edges  in  an  image  which  are  then  compared  with  the  detailed  local  signal 
by  the  pyramidal  cells  in  layers  2  and  3  of  VI,  resulting  in  a  new  refined 
signal  of  edge  strength  going  back  to  V2,  where  it  is  linked  up  further 
into  larger  edges,  etc.  Algorithms  to  do  this  in  a  computer  have  been 
extensively  studied  both  by  our  students  (cf.  Nitzberg  et  al.  1993)  and 
those  of  Zucker  (cf.  David  and  Zucker  1990). 

However,  correctly  tracing  extended  edges  is  only  one  part  of  the  prob¬ 
lem.  The  other  is  to  "color"  a  region  that  is  surrounded  by  such  a  contour, 
that  is,  marking  explicitly  homogeneous  areas  not  interrupted  by  strong 
edges.  Until  a  region  is  so  colored,  there  is  no  way  to  compute  the  features 
of  its  shape,  such  as  its  center,  its  area  and  orientation,  etc.,  hence  to  begin  an 
identification  procedure.  The  most  dramatic  evidence  that  such  an  active 
"coloring"  process  does  take  place  in  the  cortex  is  the  experiments  on  mask¬ 
ing  of  Nakayama  and  Paradiso  (1991).  Masking  seems  to  freeze  the  pro¬ 
cessing  at  an  intermediate  stage  and  they  find  partial  stages  at  which  the  ho¬ 
mogeneity  of  part  of  a  region  has  been  made  explicit,  but  not  the  whole.  The 
underlying  neural  activity  expressed  in  this  coloring  process  might  take 
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place  in  the  cytochrome-oxidase  blobs,  especially  if  some  mechanism  for 
dynamically  linking  those  blob  cells  that  are  responding  to  two  parts  of  the 
same  object  were  found.  Coloring  might  mean,  for  instance,  progressive 
entrainment  of  larger  and  larger  populations  of  cells  in  synchronized  firing. 

From  a  computational  point  of  view,  it  is  very  important  to  realize  that 
coloring  is  not  a  simple  mechanical  step  (as  it  seems  in  artificially  sim¬ 
plified  stimuli)  but  requires  in  real  images  adaptively  determining  what 
homogeneous  means,  that  is,  what  matters  is  that  the  stimulus  within  the 
cells  receptive  field  is  relatively  homogeneous  compared  to  variations  in  a 
larger  surround,  and  therefore  cannot  be  done  by  purely  local  computa¬ 
tions.  Figure  7.6,  for  example,  shows  two  images  on  top  of  which  we  have 
drawn  a  dotted  circle  to  represent  the  classical  receptive  field  of  a  VI  neu¬ 
ron.  In  these  images,  the  interiors  of  the  dotted  circles  are  identical,  hence 
the  VI  neuron  "sees"  the  same  blurry  contour.  But  in  one,  however,  the 
blurry  central  contour  is  the  perceptual  boundary  of  a  foreground  object  in 
front  of  a  background;  in  the  other,  the  blurry  contour  is  merely  a  shading 
effect  on  the  surface  of  a  different  object.  In  the  first  case,  the  central  region 
is  not  homogeneous;  in  the  second  it  is.  Thus  we  predict  that  at  least  some 
VI  neurons  with  this  receptive  field  would  exhibit  modulation  from  out¬ 
side  their  classical  receptive  field  that  reflects  this  difference.  Whether  this 
modulation  was  excitatory  or  inhibitory  would  depend  on  whether  the 
local  evidence  for  an  edge  was  strengthened  or  weakened  by  the  global 
evidence  (as  in  figure  7.6).  It  might  also  have  a  longer  latency  than  the  local 
response  (e.g.,  this  modulation  might  take  effect  50  msec  after  the  initial 
response).  We  expect  that  this  modulation  is  a  typical  effect  of  feedback 
from  V2,  where  the  larger  receptive  fields  allow  more  global  integration  of 
the  percept. 

Another  hypothesis  for  the  marking  of  object  boundaries  and  inhibiting 
the  sideways  spread  of  activity  was  made  by  Somogyi  and  Cowey  (1981). 
They  hypothesized  that  "curtains"  of  inhibitory  double-bouquet  cells  may 
activate,  cutting  off  activity  in  vertical  columns  from  neighboring  columns 
on  the  other  side  of  this  curtain,  thus  allowing  integration  of  activity  within 
the  population  of  cells  responding  to  one  portion  of  the  visible  field,  but 
preventing  this  from  interfering  with  activity  related  to  other  parts  of  the 
field.  This  could  have  a  similar  effect  in  dynamically  linking  cell  popula¬ 
tions  as  pulse  synchrony. 

SPATIOTEMPORAL  PATTERNS  AND  TEMPORAL  BUFFERING 

There  is  a  strong  tendency  in  analyzing  cognition  to  regard  space  and  time 
as  two  quite  different  things.  From  the  point  of  view  of  Pattern  Theory, 
however,  the  signals  received  by  the  brain  are  functions  of  both  space  and 
time  and  they  exhibit  patterns  in  both  dimensions.  All  of  the  characteristic 
deformations  present  in  spatially  distributed  patterns  are  present  in  tem¬ 
porally  distributed  ones  and  in  signals  depending  on  both  space  and  time. 
The  input  to  the  eyes  is  a  function  I(x ,  y,  t)  of  two  spatial  and  one  temporal 
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Figure  7.6  Two  stimuli,  identical  locally  within  a  receptive  field  (indicated  by  dotted  line), 
differing  globally.  On  the  left,  a  blurry  figure  on  a  black  ground,  its  edge  within  the  receptive 
field;  on  the  right,  a  single  shaded  figure  on  a  textured  ground,  its  edge  outside  the  receptive 
field. 

variable;  the  output  of  the  cochlea  is  a  filtered  function  s(lj,  t)  of  frequency 
and  time;  the  signal  from  the  proprioceptive  system  is  a  function  m(k,  t) 
indicating  the  stretch  and  tension  of  the  kth.  muscle  at  time  t.  In  this  section, 
we  want  to  examine  the  specific  problems  of  computing  temporal  patterns 
in  signals. 

In  vision,  we  often  make  the  assumption  that  after  initial  temporal  fil¬ 
tering  by  the  ganglion  and  amacrine  cells  in  the  retina,  the  remainder  of 
the  visual  system  is  presented  with  an  instantaneous  representation  of  the 
image  and  its  optical  flow,  which  can  be  analyzed  as  a  fixed  signal.  Observ¬ 
ing  experimentally  the  modulation  of  a  response  to  time  varying  patterns 
is  difficult  because  of  the  apparently  stochastic  nature  of  spiking,  which 
requires  averaging  the  cell's  response  for  as  long  as  possible.  The  stan¬ 
dard  experimental  approach  has  been  finding  a  way  of  keeping  up  a  fixed 
optimal  stimulus  for  as  long  as  possible,  "tickling"  the  cell  as  it  were.  In 
the  visual  system,  this  leads  the  experimenter  to  prefer  repetitively  and 
constantly  moving  stimuli  and  this  prevents  one  from  analyzing  the  de¬ 
pendence  of  the  response  on  subtler  temporal  variations  of  the  signal. 

None  of  this  addresses  an  obvious  aspect  of  natural  stimuli:  in  general, 
these  are  neither  still  nor  moving  regularly.  Natural  stimuli  often  move 
and  change  in  complex  ways  that  are  essential  for  the  proper  identification 
of  their  source.  A  simple  example  is  the  identification  of  people  through 
characteristic  gestures  and  fleeting  expressions:  it  is  as  though  we  preserve 
movie  clips  of  typical  things  our  friends  do,  and  can  match  this  memory 
against  the  fleeting  temporal  signal  that  we  receive.  Likewise,  it  is  well 
known  that  the  recognition  of  phonemes  cannot  be  done  successfully  from 
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the  analysis  of  speech  at  any  single  instant,  but  requires  the  integration  of 
clues  hidden  in  the  preceding  and  succeeding  phonemes  as  well  (Liberman 
1982).  All  of  these  tasks  require  temporal  buffering :  the  temporary  storage  of 
the  sensory  signal  or  its  features  while  the  remainder  of  the  signal  continues 
to  unfold.  To  model  this  will  require  neural  mechanisms  that,  as  far  as  we 
know,  have  not  yet  been  described  and  to  find  these  mechanisms  will 
require  the  presentation  and  analysis  of  responses  to  more  complex  time 
varying  signals  than  have  been  studied  as  yet. 

A  specific  case  is  the  LGN  and  the  motion  pathway  (called  magnocellular 
or  M  in  the  monkey,  and  Y  pathway  in  the  cat).  During  a  single  fixation  of 
the  eyes,  a  small  moving  object  may  stimulate  many  ganglion  cells  in  the 
M  pathway  as  its  image  crosses  the  retina.  Often,  we  may  want  our  eyes  to 
make  a  transition  to  pursuit  mode,  following  the  object  to  "freeze"  its  image 
on  the  retina.  To  do  so  requires  that  we  predict  where  the  object  will  be  after 
the  next  100  msec  or  so,  hence  that  we  have  an  accurate  record  of  where 
the  object  was  in  the  last  100  msec.  Since  M  cells  are  very  transient,  some 
mechanism  is  needed  to  sustain  activity  until  the  end  of  the  fixation,  while 
its  velocity  is  being  calculated.  We  would  like  to  make  the  hypothesis  that, 
at  least  in  the  cat,  the  LGN  Y  pathivay  cells  are  used  for  this  temporal  buffering, 
their  activity  being  sustained  by  corticothalamic  feedback  after  the  moving  object 
passes  their  receptive  field .  This  possibility  is  suggested  by  the  cell  counts  in 
the  cat  LGN  that  show  that  there  are  about  12  times  as  many  LGN  cells 
in  its  Y  pathway  as  there  are  retinal  ganglion  cells  (Sherman  and  Koch 
1986).  Such  a  population  could  encode  the  time  history  of  the  stimulus  in 
many  ways.  It  could  store  a  sequence  of  activity  states  in  different  cells; 
more  likely,  the  cells  might  population  code  this  history,  or  features  of  this 
history,  like  acceleration,  stops,  and  starts. 

Other  prime  candidates  for  detecting  temporal  buffering  are  A1  and  Ml. 
In  both  areas,  it  seems  essential  to  buffer  the  temporal  activity  pattern  (i.e., 
the  auditory  signal  over  something  like  the  last  200  msec)  or  the  motor 
commands  over  the  next  200  msec).  In  Al,  this  should  be  especially  sim¬ 
ple  to  check:  one  needs  to  record  and  analyze  responses  to  pairs  of  sounds, 
presented  sequentially.  The  null  hypothesis,  that  there  is  no  buffering, 
would  imply  that  the  response  to  the  second  part  of  the  stimulus  is  inde¬ 
pendent  of  the  first  part  of  the  stimulus.  Temporal  buffering  would  predict 
some  kind  of  modulation  of  the  second  response.  As  far  as  we  know,  a 
neurophysiological  experiment  to  look  for  this  kind  for  buffering  has  not 
been  done. 

LEARNING  THE  HIDDEN  VARIABLES  AND  THEIR  PRIORS  VIA 
MINIMUM  DESCRIPTION  LENGTH 

Bayesian  statistics  was  one  of  the  main  inspirations  for  Pattern  Theory.  It 
goes  like  this:  assume  that  X  is  a  set  of  variables  describing  the  world — 
called  the  hidden  variables — and  that  Y  is  the  data  we  observe.  We  assume, 
moreover,  that  from  experience  we  know  the  "prior"  probability  pr(X  =  x) 
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[or  pr(x)  for  short]  that  the  variables  X  take  on  every  possible  set  of  values 
x  (e.g.,  you  know  it  is  very  unlikely  that  your  grandmother  is  wearing  a 
bikini),  and  that  we  also  know  the  conditional  probability  of  every  possible 
observation  y  given  the  state  of  the  world  x,  written  pr(Y  =  y\X  =  x)  [or 
pr(x|y)  for  short].  Then  if  we  have  observations  y,  we  will  want  to  estimate 
the  most  likely  a  posteriori  values  x  of  the  hidden  variables  describing  the 
world.  Bayes  says  to  do  this  by  finding  the  x  that  makes  the  conditional 
probability  pr(x| y)  the  largest,  which  by  Bayes's  theorem  is  the  x  that  max¬ 
imizes  [pr(y|x)  •  pr(x)] .  (So  if  we  think  we  see  Granny  in  a  bikini  at  a  great 
distance,  we  reject  the  conclusion,  but  if  we  see  her  so  attired  close  up,  we 
have  to  accept  it  as  fact.)  The  optimal  value  x  so  calculated  is  called  the 
maximal  a  posteriori  or  "MAP"  estimate  of  the  world  variables.  This  is 
standard  stuff. 

To  use  this  rule,  one  needs  to  learn,  store,  and  apply  via  Bayes's  rule 
both  the  prior  probability  distribution  on  the  world  variables  X  and  the 
conditional  probability  on  the  observations  Y  given  X.  In  a  biological 
setting,  it  is  possible  to  imagine  that  these  probability  distributions  were 
somehow  learned  by  natural  selection  and  have  become  encoded  into  the 
genes.  Perhaps  this  happens  with  some  animals — for  instance  the  overall 
structure  of  a  bird's  song  seems  to  be  genetically  encoded — but  this  does 
not  seem  to  account  for  the  flexibility  of  mammalian  responses.  For  in¬ 
stance,  a  human  infant  bom  into  a  complex  technological  culture  has  no 
trouble  learning  how  to  use  TV  sets.  There  are  various  approaches  to  learn¬ 
ing  these  probability  distributions  "on  the  fly,"  but  one  that  fits  in  cleanly 
with  both  Bayesian  statistics  and  Pattern  Theory  is  to  use  the  Minimum 
Description  Length  Principle.  This  approach  is  particularly  attractive  in 
that  it  suggests  how  the  world  variables  X  themselves  might  be  learned, 
not  merely  their  distribution. 

The  Minimum  Description  Length  (or  MDL)  Principle  says  that,  starting 
with  many  observations  Y  =  yn,  you  may  take  advantage  of  the  patterns 
and  repetitions  in  this  string  of  observations  to  reencode  Y  so  that,  with 
high  probability,  if  every  new  observation  is  reencoded  in  this  way,  it  will 
have  much  shorter  length  (in  bits).  For  example,  suppose  five  different 
bird  songs  are  heard  regularly  in  your  back  yard.  You  can  assign  a  short 
distinctive  code  to  each  such  song,  so  that  instead  of  having  to  remember 
the  whole  song  from  scratch  each  time,  you  just  say  to  yourself  something 
like  "Aha,  song  #3  again."  Note  that  in  doing  so,  you  have  automatically 
learned  a  world  variable  at  the  same  time:  the  number  or  code  you  use  for 
each  song  is,  in  effect,  a  name  for  a  species,  and  you  have  rediscovered  a 
bit  of  Linnaean  biology.  Moreover,  if  one  bird  is  the  most  frequent  singer, 
you  will  probably  use  the  shortest  code  (e.g.,  "song  #1")  for  that  bird.  In 
this  way,  you  are  also  learning  the  probability  of  different  values  for  the 
variable  "song  #x."  This  is  nothing  more  than  the  fundamental  theorem 
of  Shannon's  information  theory  that  provides  the  link  between  coding 
length  and  probabilities.  His  theorem  states  that  if  you  want  to  encode  the 
different  values  x  of  variables  X  so  that  the  average  length  of  the  code  is 
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smallest,  then  the  length  of  the  code  c(x)  in  bits  will  be 
c(x)  =  -  log2  [pr(X  =  x)]  . 

(A  technical  point:  in  this  formula,  the  log  is  a  positive  real  number  that 
need  not  be  an  integer.  But  the  number  of  bits  in  a  code  is  always  an 
integer.  So  what  Shannon  did,  to  get  this  elegant  relationship,  was  to 
consider  "block  coding,"  codes  where  several  signals  were  encoded  at 
once.  If  k  signals  were  encoded,  the  code  length  for  each  signal  is  1  jk 
times  the  length  of  the  block  code.  Then  the  exact  theorem  states  that  by 
considering  longer  and  longer  block  codes,  the  left  hand  side  gets  as  close 
as  you  want  to  the  right.) 

How  could  finding  the  MAP  estimate  be  implemented  cortically?  The 
natural  hypothesis  is  that  the  probabilities  of  each  set  of  values  x  of  the 
hidden  world  variables  and  of  the  probabilities  of  making  an  observation 
pr(y|x)  are  stored  in  the  mechanism  for  pattern  synthesis,  so  that  there  is  a 
tendency  to  synthesize  the  most  likely  patterns  first,  the  less  likely  coming 
to  the  fore  only  if  the  more  likely  ones  are  inhibited  by  mismatch  with  the 
input  (as  in  Carpenter  and  Grossberg  1987).  For  instance,  when  a  pattern 
is  synthesized  to  imitate  a  new  signal,  the  most  likely  values  might  be 
chosen  by  some  summation  of  activation  proportional  to  log[pr(x,  y)]  (see 
Lee  1992).  In  terms  of  MDL,  we  can  say  that  the  higher  level  cortical  area 
somehow  seeks  the  most  economical  way,  the  simplest  pattern  of  firing, 
that  will  generate  a  top-down  synthesized  signal  close  to  the  true  sensory 
signal. 

I  would  like  to  give  a  more  elaborate  example  to  show  how  MDL  can 
lead  you  to  the  correct  variables  with  which  to  describe  the  world  using 
an  old  and  familiar  vision  problem:  the  stereo  correspondence  problem. 
The  usual  approach  to  stereo  vision  is  apply  our  knowledge  of  the  three- 
dimensional  structure  of  the  world  to  show  how  matching  the  images  Jl 
and  Jr  from  the  left  and  right  eyes  leads  us  to  a  reconstruction  of  depth 
through  the  "disparity  function"  d(x,  y)  such  that  IL(x +rf(x,  y),  y)  is  approx¬ 
imately  equal  to  Jr(x,  y).  In  doing  so,  most  algorithms  take  into  account 
the  "constraint"  that  most  surfaces  in  the  world  are  smooth,  so  that  depth 
and  disparity  vary  slowly  as  we  scan  across  an  image.  The  MDL  approach 
is  quite  different.  First,  the  raw  perceptual  signal  comes  as  two  sets  of  N 
pixel  values  h{x,y)  and  Jr (x,y)  each  encoded  up  to  some  fixed  accuracy 
by  d  bits,  totaling  2  •  d  •  N  bits.  But  the  attentive  encoder  notices  how  often 
pieces  of  the  left  image  code  nearly  duplicate  pieces  of  the  right  code:  this 
is  a  common  pattern  that  cries  out  for  use  in  shrinking  the  code  length.  So 
we  are  led  to  code  the  signal  in  three  pieces:  first  the  raw  left  image  Jl(*,  y), 
then  the  disparity  d(x ,  y),  and  finally  the  residual  Jr(x,  y)  -  JL(x  +  d(x,  y),  y). 
The  disparity  and  the  residual  are  both  quite  small,  so  instead  of  d  bits, 
these  may  need  only  a  small  number  e  and /  bits,  respectively.  Provided  d 
>  e  +/,  we  have  saved  bits.  In  fact,  if  we  use  the  constraint  that  surfaces  are 
mostly  smooth,  so  that  d(x,  y)  varies  slowly,  we  can  further  encode  d(x ,  y)  by 
its  average  value  do(y)  on  each  horizontal  line  and  its  x-derivative  dx(x ,  y), 
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which  is  mostly  much  smaller.  The  important  point  is  that  MDL  coding 
leads  you  introduce  the  third  coordinate  of  space,  that  is,  to  discover  three- 
dimensional  space!  A  further  study  of  the  discontinuities  in  d,  and  the 
"nonmatching"  pixels  visible  to  one  eye  only  goes  further  and  leads  you  to 
invent  a  description  of  the  image  containing  labels  for  distinct  objects,  that 
is,  to  discover  that  the  world  is  usually  made  up  of  discrete  objects.  For  a  more 
complete  discussion,  see  Mumford  (1993,  §5d). 

Can  the  learning  phase  of  MDL  be  implemented  in  a  natural  way  in 
cortex?  We  think  this  is  one  of  the  most  interesting  challenges  to  Pattern 
Theory.  We  have  no  proposal  except  to  say  that  recent  work  (Intrator 
1992;  Jordan  and  Jacobs  1993;  Hinton,  unpublished  observations)  shows 
that  many  learning  rules,  more  complex  than  simple  Hebbian  learning,  are 
possible  and  suggestive.  Hinton's,  especially,  looks  like  it  might  solve  the 
stereo  problem  along  the  lines  proposed  above. 

SUMMARY 

Starting  from  the  theoretical  perspective  of  Pattern  Theory,  this  chapter  has 
made  some  specific  proposals  for  the  data  structures  and  computational 
mechanisms  to  be  expected  in  the  cortex.  These  include  (1)  the  need  for 
feedback  loops  activating  template-like  patterns  in  lower  corical  areas, 
(2)  a  mechanism  for  shifting  or  warping  patterns  of  cortical  activity,  (3) 
marking  both  boundaries  between  unrelated  features  and  the  complexes  of 
related  activity  with  a  common  source,  (4)  the  need  for  temporal  buffering, 
(5)  multiscale  population  coded  representations,  and  (6)  the  possibility 
that  the  Minimum  Description  Length  Principle  can  be  used  as  a  basis  of 
learning  world  structures. 

A  common  thread  in  all  the  specific  proposals  above  is  the  need  for 
more  sophisticated  experimental  stimuli,  motivated  by  computational  or 
psychological  theory.  A  well-known  experimenter  laughed  at  me  10  years 
ago  when  we  suggested  that  one  should  look  for  cell  responses  in  higher 
visual  areas  correlated  to  global  features  of  the  image  outside  its  "classical" 
receptive  field.  Shortly  thereafter,  von  der  Heydt's  experiments  provided 
the  first  dramatic  proof  that  this  occurs  (von  der  Heydt  et  al.  1984).  Real 
world  stimuli  have  a  huge  number  of  complexities  and  subtleties  not  even 
remotely  present  in  typical  laboratory  stimuli  and  these  should  be  studied, 
one  at  a  time,  to  see  how  the  cortex  handles  them.  For  example,  one  can 
present  edges  that  are  blurred  or  noisy,  curved  or  interrupted,  embedded 
in  textures  or  with  contrast  reversals.  One  can  use  complex  temporal  orga¬ 
nization,  comparing  an  extended  continuous  movement  with  many  small 
movements  that  flicker  off.  Two  general  paradigms  suggest  themselves: 
one  is  to  use  pairs  of  stimuli  that  are  locally  identical,  but  globally  quite 
different.  In  this  case,  the  higher  cortical  area  can  respond  to  the  larger 
features  and  so  modulate  the  responses  of  a  cell  in  the  lower  area  to  two 
stimuli  identical  within  its  receptive  field.  The  second  is  really  a  special 
case  of  this:  to  present  stimuli  that  are  neutral  locally,  not  stimulating  a 
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particular  cell,  but  that  have  major  global  organization  that  may  imply 
local  structure,  and  see  if  it  affects  the  original  cell. 

A  second  thread  is  the  need  to  consider  feedback  effects  when  modeling 
cortical  responses.  Our  observation  is  that  there  is  a  strong  bias  toward 
seeking  simple  feedforward  explanations  of  what  the  cortex  is  doing.  For 
instance,  Marr's  book  (Marr  1982)  is  essentially  a  purely  feedforward  the¬ 
ory  of  vision.  If  any  of  the  above  theorizing  is  half  right,  feedback  plays  a 
major  role  in  both  low-  and  high-level  processing  and  cannot  be  ignored, 
even  in  primary  sensory  and  motor  areas. 
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Observations  on  Cortical  Mechanisms  for 
Object  Recognition  and  Learning 

Tomaso  A.  Poggio  and  Anya  Hurlbert 


INTRODUCTION 

One  of  the  main  goals  of  vision  is  object  recognition.  But  there  may  be 
many  distinct  routes  to  this  goal  and  the  goal  itself  may  come  in  several 
forms.  Anyone  who  has  struggled  to  identify  a  particular  amoeba  swim¬ 
ming  on  a  microscope  slide  or  to  distinguish  between  novel  visual  stimuli 
in  a  psychophysics  laboratory  might  admit  that  recognizing  a  familiar  face 
seems  an  altogether  different  and  simpler  task.  Recent  evidence  from  sev¬ 
eral  lines  of  research  strongly  suggests  that  not  all  recognition  tasks  are 
the  same.  Psychophysical  results  and  computational  analyses  suggest  that 
recognition  strategies  may  depend  on  the  type  of  both  object  and  visual 
task.  Symmetric  objects  are  better  recognized  from  novel  viewpoints  than 
asymmetric  objects  (Poggio  and  Vetter  1992);  when  moved  to  novel  lo¬ 
cations  in  the  visual  field,  objects  with  translation-invariant  features  are 
better  recognized  than  those  without  (Nazir  and  O'Regan  1990;  Bricolo 
and  Biilthoff  1992).  A  typical  agnosic  patient  can  distinguish  between  a 
face  and  a  car,  a  classification  task  at  the  basic  level  of  recognition,  but 
cannot  recognize  the  face  of  Marilyn  Monroe,  an  identification  task  at  the 
subordinate  level  (Damasio  et  al.  1990b).  A  recently  reported  stroke  patient 
cannot  identify  the  orientation  of  a  line  but  can  align  her  hand  with  it  if 
she  imagines  posting  a  letter  through  it,  suggesting  strongly  that  there  are 
also  multiple  outputs  from  visual  recognition  (Goodale  et  al.  1991). 

Yet  although  recognition  strategies  diverge,  recent  theories  of  object 
recognition  converge  on  one  mechanism  that  might  underlie  several  of 
the  distinct  stages,  as  we  will  argue  in  this  chapter.  This  mechanism  is 
a  simple  one,  closely  related  to  template  matching  and  nearest  neighbor 
techniques.  It  belongs  to  a  class  of  explanations  that  we  call  memory- 
based  models  (MBMs),  which  includes  memory-based  recognition,  sparse 
population  coding,  generalized  radial  basis  functions  networks,  and  their 
extension,  hyper  basis  functions  (HBF)  networks  (Poggio  and  Girosi  1990b) 
(figure  8.1).  In  MBMs,  classification  or  identification  of  a  visual  stimulus 
is  accomplished  by  a  network  of  units.  Each  unit  is  broadly  tuned  to  a 
particular  template,  so  that  it  is  maximally  excited  when  the  stimulus  ex¬ 
actly  matches  its  template  but  also  responds  proportionately  less  to  similar 
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Figure  8.1  An  RBF  network  for  the  approximation  of  two-dimensional  functions  (left)  and 
its  basic  "hidden"  unit  (right),  x  and  y  are  components  of  the  input  vector  which  is  compared 
via  the  RBF  h  at  each  center  t.  Outputs  of  the  RBFs  are  weighted  by  the  C;  and  summed  to 
yield  the  function  F  evaluated  at  the  input  vector.  N  is  the  total  number  of  centers. 

stimuli.  The  weighted  sum  of  activities  of  all  the  units  uniquely  labels  a 
novel  stimulus.  Several  recent  and  successful  face  recognition  schemes  for 
machine  vision  share  aspects  of  this  framework  (Baron  1981;  Bichsel  1991; 
Brunelli  and  Poggio  1992;  Turk  and  Pentland  1991;  Stringa  1992a,b). 

We  will  consider  how  the  basic  features  of  this  class  of  models  might  be 
implemented  by  the  human  visual  system.  Our  aim  is  to  demonstrate  that 
such  models  conform  to  existing  physiological  data  and  to  make  further 
physiological  predictions.  We  will  use  as  a  specific  example  of  the  class 
the  HBF  network.  HBF  networks  have  been  used  successfully  to  solve 
isolated  visual  tasks,  such  as  learning  to  detect  displacements  at  hyper¬ 
acuity  resolution  (Poggio  et  al.  1992)  or  learning  to  identify  the  gender  of 
a  face  (Brunelli  and  Poggio  1992).  We  will  discuss  how  the  units  of  a  HBF 
network  might  be  realized  as  neurons  and  how  a  similar  network  might 
be  implemented  by  cortical  circuitry  and  replicated  at  many  levels  to  per¬ 
form  the  multicomponent  task  of  visual  recognition.  We  hope  to  show  that 
MBMs  are  not  merely  toy  replicas  of  neural  systems,  but  viable  models  that 
make  testable  biological  predictions. 

The  main  predictions  of  memory-based  models  are  as  follows: 

•  The  existence  of  broadly  tuned  neurons  at  all  levels  of  the  visual  pathway, 
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tuned  to  single  features  or  to  configurations  in  a  multidimensional  feature 
space. 

•  At  least  two  types  of  plasticity  in  the  adult  brain,  corresponding  to  two 
stages  of  learning  in  perceptual  skills  and  tasks.  One  stage  probably  in¬ 
volves  changes  in  the  tuning  of  individual  neuron  responses;  this  resembles 
adaptation.  The  other  probably  requires  changes  in  cortical  circuitry  spe¬ 
cific  to  the  task  being  learned,  connecting  many  neurons  across  possibly 
many  areas. 

OBJECT  RECOGNITION:  MULTIPLE  TASKS,  MULTIPLE  PATHWAYS 

Recognizing  an  object  should  be  difficult  because  it  rarely  looks  the  same  on 
each  sighting.  Consider  the  prototypical  problem  of  recognizing  a  specific 
face.  (We  believe  that  processing  of  faces  is  not  qualitatively  different  from 
processing  of  other  3D  objects,  although  the  former  might  be  streamlined 
by  practice,  and  biological  evidence  supports  this  view  [Gross,  1992].)  The 
2D  retinal  image  formed  by  the  face  changes  with  the  observer's  viewpoint, 
and  with  the  many  transformations  that  the  face  can  undergo:  changes  in 
its  location,  pose,  and  illumination,  as  well  as  nonrigid  deformations  such 
as  the  transition  from  a  smile  to  a  frown.  A  successful  recognition  system 
must  be  robust  under  all  such  transformations. 

Here  we  outline  an  architecture  for  a  recognition  system  that  contains 
what  we  believe  are  the  rudimentary  elements  of  a  robust  system.  It  is  best 
considered  as  a  protocol  for  and  summary  of  existing  programs  in  machine 
vision,  but  it  also  represents  an  attempt  to  delineate  the  stages  probably 
involved  in  visual  recognition  by  humans.  The  scheme  (diagrammed  in 
figure  8.2)  has  dual  routes  to  recognition.  The  first  is  a  streamlined  route 
to  recognition  in  which  the  features  extracted  in  the  early  stages  of  im¬ 
age  analysis  are  matched  directly  to  samples  in  the  database.  The  second 
potential  route  to  recognition  diverges  from  the  first  to  allow  for  the  pos¬ 
sibility  that  both  the  database  models  and  the  extracted  image  features 
might  need  further  processing  before  a  match  can  be  found. 

Our  task  in  recognizing  a  face — or  any  other  3D  object — consists  of  mul¬ 
tiple  tasks,  which  fall  into  three  broad  categories  that  characterize  both 
routes: 

•  Segmentation :  Marking  the  boundaries  of  the  face  in  the  image.  This 
stage  typically  involves  segmenting  the  entire  image  into  regions  likely 
to  correspond  to  different  materials  or  surfaces  (and  thereby  subsumes 
figure-ground  segregation)  and  is  a  prerequisite  for  further  analysis  of  a 
marked  region.  Image  measurements  are  used  to  convert  the  retinal  array 
of  light  intensities  into  a  primal  image  representation,  by  computing  sparse 
measurements  on  the  array,  such  as  intensity  gradients,  or  center-surround 
outputs.  The  result  is  a  set  of  vector  measurements  at  each  of  a  sparse  or 
dense  set  of  locations  in  the  image.  Features  may  then  be  found,  and  used 
to  partition  the  image. 
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Figure  8.2  A  sketch  of  an  architecture  for  recognition  with  two  hypothetical  routes  to  recog¬ 
nition.  Single  arrows  represent  the  classification  and  indexing  route  described  in  appendix 
A.  Double  arrows  represent  the  main  visualization  route,  and  dashed  arrows  represent  alter¬ 
native  pathways  within  it. 
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•  Classification ,  or  basic-level  recognition:  Distinguishing  objects  that  are 
faces  from  those  that  are  not.  Parameter  values  estimated  in  the  preceding 
stage — (e.g.,  the  distance  between  eyes  and  mouth) — are  used  in  this  stage 
for  classification  of  a  set  of  features — (e.g.,  as  a  potential  face,  animal,  or 
man-made  tool).  This  stage  requires  that  the  boundaries  or  the  location  of 
at  least  potential  faces  be  demarcated,  and  hence  generally  depends  on  the 
preceding  step  of  image  segmentation,  although  it  may  work  without  it  at 
an  added  computational  cost. 

•  Identification,  or  subordinate-level  recognition:  Matching  the  face  to  a 
stored  memory,  and  thereby  labeling  it.  This  stage  requires  some  form  of 
indexing  of  the  database  samples.  Because  it  is  computationally  implausi¬ 
ble  that  the  recognition  system  contains  a  stored  sample  of  the  face  in  each 
of  its  possible  views  or  expressions,  or  under  all  possible  illumination  con¬ 
ditions  at  all  possible  viewing  distances,  this  step  in  general  also  requires 
that  the  face  be  transformed  into  a  standard  form  for  matching  against  its 
stored  template.  Thus  in  parallel  with  the  direct  route  from  classification 
to  identification  there  exists  a  second  route  that  we  call  the  visualization 
route,  which  may  include  an  iterative  sequence  of  transformations  of  both 
the  image-plane  and  the  database  models  until  it  converges  on  a  match. 

These  stages,  and  some  open  questions  on  the  overall  architecture,  are 
discussed  further  in  appendix  A. 

As  outlined  here,  the  stages  are  distinct  and  could  be  implemented  in 
series  within  each  route  to  recognition.  Most  artificial  face  recognition 
systems  tackle  the  stages  separately,  being  designed  either  to  detect  and 
localize  a  face  in  an  image  cluttered  with  other  objects  (segmentation  and 
classification),  or  to  identify  individual  faces  presented  in  an  expected  for¬ 
mat  (database  indexing  and  identification).  Some  artificial  recognition 
systems  have  been  constructed  to  achieve  invariant  recognition  under  iso¬ 
lated  transformations  (visualization).  Examples  are  systems  that  recognize 
frontal  views  of  faces  under  varying  illuminations  (Brunelli  and  Poggio 
1992);  recognize  simple  paper-clip-like  objects  independently  of  viewpoint 
(Poggio  and  Edelman  1990);  or  identify  simple  objects  solely  by  color  under 
spatially  varying  illumination  (Swain  and  Ballard  1990). 

Yet  in  biological  systems,  and  in  some  artificial  systems,  the  stages  may 
act  in  parallel  or  even  merge.  For  example,  there  may  be  many  short-cuts 
to  recognizing  a  frequently  encountered  object  such  as  a  face. 

Finding  the  face  might  be  streamlined  by  a  quick  search  at  low  resolu¬ 
tion  over  the  whole  image  for  face-like  patterns.  The  search  might  employ 
simplified  templates  of  a  face  containing  anthropometric  information  (for 
example,  a  two-eyes-and-mouth  mask).  Once  located,  salient  features  such 
as  eyes  can  be  used  to  demarcate  the  entire  object  to  which  they  belong, 
eliminating  the  need  to  segment  other  parts  of  the  image.  These  detec¬ 
tors  would  scan  the  image  for  the  presence  of  these  face-specific  features, 
and  using  them,  locate  the  face  for  further  processing  (translation,  scaling, 
etc.).  (Some  machine  vision  systems  already  implement  this  idea,  using 
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translation-invariant  face-feature-detectors  such  as  eye  detectors  [Bichsel 
1991]  or  symmetry  detectors.)  Thus  segmentation  may  occur  simultane¬ 
ously  with  classification.  The  existence  of  these  face  detectors  in  the  human 
visual  system  might  explain  why  we  so  readily  perceive  faces  in  the  sim¬ 
plest  drawings  of  dots  and  lines,  or  in  symmetric  patterns  formed  in  nature 
(Hurlbert  and  Poggio  1986),  and  why  we  detect  properly  configured  faces 
more  readily  than  arbitrary  or  inverted  arrangements  of  facial  features 
(Purcell  and  Stewart  1988).  Indeed,  we  wonder  whether  face  recognition 
may  have  become  so  inveterate  that  the  human  brain  might  first  classify 
image  regions  into  face  or  nonface. 

Recognizing  an  expected  object  might  also  be  more  speedy  and  efficient 
than  identifying  an  unexpected  one.  In  the  classification  stage,  only  those 
features  specific  for  the  expected  object  class  need  be  measured,  and  cor¬ 
rect  classification  would  not  require  that  all  features  be  simultaneously 
available.  This  step  might  therefore  be  itself  a  form  of  template  matching, 
where  part-templates  may  serve  as  well  as  whole-templates  to  locate  and 
classify  the  object.  In  many  cases  the  classification  stage  may  lead  by  itself 
to  unique  recognition,  especially  when  situational  information,  such  as  the 
expectedness  of  the  object,  restricts  the  relevant  database. 

Yet  many  questions  are  left  hanging  by  this  sketch  of  a  recognition  sys¬ 
tem.  In  biological  systems,  is  matching  done  between  primal  image  rep¬ 
resentations,  like  center-surround  outputs  at  sparse  locations,  or  between 
sets  of  higher  level  features?  Computational  experiments  on  face  recogni¬ 
tion  suggest  that  the  former  strategy  performs  much  better.  What  exactly 
are  the  key  features  used  for  identifying,  localizing  and  normalizing  an 
object  of  a  specific  class?  Is  there  an  automatic  way  to  learn  them  (Huber 
1985)?  Do  biological  visual  systems  acquire  recognition  features  through 
experience  (Edelman  1991)?  Do  humans  use  expectation  to  restrict  the 
database  for  categorization?  Some  psychophysical  experiments  suggest 
that  we  do  not  need  higher-level  expectations  to  recognize  objects  quickly 
in  a  random  series  of  images,  but  these  experiments  have  used  familiar 
objects  such  as  the  Eiffel  Tower  (M.  Potter,  personal  communication). 

A  Sketch  of  a  Memory-Based  Cortical  Architecture  for  Recognition 

We  suggest  that  most  stages  in  face  recognition,  and  more  generally,  in  ob¬ 
ject  recognition,  may  be  implemented  by  modules  with  the  same  intrinsic 
structure — a  memory  based  module  (MBM).  At  the  heart  of  this  structure 
is  a  set  of  neurons  each  tuned  to  a  particular  value  or  configuration  along 
one  or  many  feature  dimensions.  Let  us  take  as  an  example  of  such  a 
structure  the  hyper  basis  functions  (HBF)  network.  This  is  a  convenient 
choice  because  HBFs  have  been  successfully  applied  already  to  several 
problems  in  object  recognition  as  well  as  an  unrestrictive,  easily  modifi¬ 
able  choice  because  HBFs  are  closely  related  to  other  approximation  and 
learning  techniques  such  as  multilayer  perceptrons. 
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RBF  Networks  HBF  networks  are  approximation  schemes  based  on,  but 
more  flexible  than,  radial  basis  functions  (RBF)  networks  (see  figure  8.1; 
Poggio  and  Girosi  1990b;  Poggio  1990).  The  fundamental  equation  under¬ 
lying  RBF  networks  states  that  any  function /(x)  may  be  approximated  by 
a  weighted  sum  of  RBFs: 

N 

/to  =  X)  C'^HX  ~  till)2  +  Pto-  (8-1) 

i= 1 

The  functions  h  may  be  any  of  the  class  of  RBFs,  for  example,  Gaussians. 
p(x)  is  a  polynomial  that  is  required  by  certain  RBFs  for  the  validity  of 
the  equation.  (For  some  RBFs,  e.g.,  Gaussians,  the  addition  of  p(x)  is  not 
necessary,  but  improves  performance  of  the  network.)  In  an  RBF  network, 
each  "unit"  computes  the  distance  ||x  -  t||  of  the  input  vector  x  from  its 
center  t  and  applies  the  function  h  to  the  distance  value,  that  is,  it  computes 
the  function  h(  ||  x- 1 1|  )2.  The  N  centers  t,  corresponding  to  the  N  data  points, 
thus  behave  like  templates,  to  which  the  inputs  are  compared  for  similarity. 

A  typical  and  illustrative  choice  of  RBF  is  the  Gaussian  [h(\\x  -  t||)  = 
exp(— (||x  —  t||)2/2cr2)].  In  the  limiting  case  where  h  is  a  very  narrow  Gaus¬ 
sian,  the  network  effectively  becomes  a  look-up  table,  in  which  a  unit  gives 
a  nonzero  signal  only  if  the  input  exactly  matches  its  center  t. 

The  simplest  recognition  scheme  based  on  RBF  networks  that  we  con¬ 
sider  is  that  suggested  by  Poggio  and  Edelman  (1990)  (see  figure  8.3)  to 
solve  the  specific  problem  of  recognizing  a  particular  3D  object  from  novel 
views,  a  subordinate-level  task.  In  the  RBF  version  of  the  network,  each 
center  stores  a  sample  view  of  the  object,  and  acts  as  a  unit  with  a  Gaussian- 
like  recognition  field  around  that  view.  The  unit  performs  an  operation  that 
could  be  described  as  "blurred"  template  matching.  At  the  output  of  the 
network  the  activities  of  the  various  units  are  combined  with  appropriate 
weights,  found  during  the  learning  stage.  An  example  of  a  recognition 
field  measured  psychophysically  for  an  asymmetric  object  after  training 
with  a  single  view  is  shown  in  figure  8.4.  As  predicted  from  the  model 
(see  Poggio  and  Edelman  1990),  the  shape  of  the  surface  of  the  recognition 
errors  is  roughly  Gaussian  and  centered  on  the  training  view. 

In  this  particular  model,  the  inputs  to  the  network  are  spatial  coordinates 
or  measurements  of  features  (e.g.,  angles  or  lengths  of  segments)  computed 
from  the  image.  In  general,  though,  the  inputs  to  an  RBF  network  are 
not  restricted  to  spatial  coordinates  but  could  include,  for  example,  colors 
or  configurations  of  segments,  binocular  disparities  of  features,  or  texture 
descriptions.  Certainly  in  any  biological  implementation  of  such  a  network 
the  inputs  may  include  measurements  or  descriptions  of  any  attribute  that 
the  visual  system  may  represent.  We  assume  that  in  the  primate  visual 
system  such  a  recognition  module  may  use  a  large  number  of  primitive 
measurements  as  inputs,  taken  by  different  "filters"  that  can  be  regarded  as 
many  different  "templates"  for  shape,  texture,  color,  and  so  forth.  The  only 
restriction  is  that  the  features  must  be  directly  computed  from  the  image. 
Hence  the  inputs  are  viewer-centered,  not  object-centered,  although  some. 
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Figure  8.3  (rt)  The  HBF  network  proposed  for  the  recognition  of  a  3D  object  from  any  of 
its  perspective  views  (Poggio  and  Edelman  1990).  The  network  attempts  to  map  any  view 
(as  defined  in  the  text)  into  a  standard  view,  arbitrarily  chosen.  The  norm  of  the  difference 
between  the  output  vector  f  and  the  standard  view  s  is  thresholded  to  yield  a  0, 1  answer 
(instead  of  the  standard  view  the  output  of  the  netwok  can  be  directly  a  binary  classification 
label).  The  2 N  inputs  accommodate  the  input  vector  v  representing  an  arbitrary  view.  Each 
of  the  n  radial  basis  functions  is  initially  centered  on  one  of  a  subset  of  the  M  views  used  to 
synthesize  the  system  (n  <  M).  During  training  each  of  the  M  inputs  in  the  training  set  is 
associated  with  the  desired  output  (i.e.,  the  standard  view  s).  (b)  A  completely  equivalent 
interpretation  of  (fl)  for  the  special  case  of  Gaussian  radial  basis  functions.  Gaussian  func¬ 
tions  can  be  synthesized  by  multiplying  the  outputs  of  two-dimensional  Gaussian  receptive 
fields,  that  "look"  at  the  retinotopic  map  of  the  object  point  features.  The  solid  circles  in  the 
image  plane  represent  the  2D  Gaussians  associated  with  the  first  radial  basis  function,  which 
represents  the  first  view  of  the  object.  The  dotted  circles  represent  the  2D  receptive  fields 
that  synthesize  the  Gaussian  radial  function  associated  with  another  view.  The  2D  Gaussian 
receptive  fields  transduce  values  of  features,  represented  implicitly  as  activity  in  a  retino¬ 
topic  array,  and  their  product  "computes"  the  radial  function  without  the  need  of  calculating 
norms  and  exponentials  explicitly.  (From  Poggio  and  Girosi  1990c) 
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Figure  8.4  The  generalization  field  associated  with  a  single  training  view.  Whereas  it  is 
easy  to  distinguish  between,  say,  tubular  and  amoeba-like  3D  objects,  irrespective  of  their 
orientation,  the  recognition  error  rate  for  specific  objects  within  each  of  those  two  categories 
increases  sharply  with  misorientation  relative  to  the  familiar  view.  This  figure  shows  that 
the  error  rate  for  amoeba-like  objects,  previously  seen  from  a  single  attitude,  is  viewpoint- 
dependent.  Means  of  error  rates  of  six  subjects  and  six  different  objects  are  plotted  versus 
rotation  in  depth  around  two  orthogonal  axes  (Biilthoff  et  al.  1991;  Edelman  and  Bulthoff 
1992).  The  extent  of  rotation  was  ±60°  in  each  direction;  the  center  of  the  plot  corresponds 
to  the  training  attitude.  Shades  of  gray  encode  recognition  rates,  at  increments  of  5%  (white 
is  better  than  90%;  black  is  50%).  (From  Bulthoff  and  Edelman  1992).  Note  that  viewpoint 
independence  can  be  achieved  by  familiarizing  the  subject  with  a  sufficient  number  of  training 
views  of  the  3D  object. 


like  color,  will  be  viewpoint-independent.  The  output  of  the  network  is, 
though,  object-centered,  provided  there  is  a  sufficient  number  of  centers. 
This  generality  of  the  network  permits  a  mix  of  2D  and  3D  information  in 
the  inputs,  and  relieves  the  model  from  the  constraints  of  either. 

This  feature  of  the  model  also  renders  irrelevant  the  question  on  whether 
object  representations  are  2D  or  3D.  The  Poggio-Edelman  model  makes  it 
clear  that  2D-based  schemes  can  provide  view  invariance  as  readily  as  a  3D 
model  can,  and  compute  3D  pose  as  well  (see  Poggio  and  Edelman  1990). 
So  the  relevant  questions  are  what  is  explicit  in  neurons?  and  what  does 
it  mean  for  information  about  shape  to  be  explicit  in  neurons?  In  a  sense, 
some  2D-based  schemes  such  as  the  Poggio-Edelman  model  may  be  con¬ 
sidered  as  plausible  neurophysiological  implementations  of  3D  models. 

We  do  not  suggest  that  the  cortical  architecture  for  recognition  consists 
of  a  collection  of  such  modules,  one  for  each  recognizable  object.  Certainly 
it  is  more  complex  than  that  cartoon,  and  not  only  because  viewpoint  in¬ 
variance  is  not  the  only  problem  that  the  recognition  system  must  solve. 
For  example,  the  cortex  must  also  learn  to  recognize  objects  under  vary¬ 
ing  illumination  (photometric  invariance)  and  to  recognize  objects  at  the 
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basic  as  well  as  subordinate  level.  (Preliminary  results  on  real  objects 
[faces]  suggest  that  HBF  modules  can  estimate  expression  and  direction 
of  illumination  equally  as  well  as  pose  [Brunelli,  personal  communication; 
Beymer,  personal  communication].)  Yet  each  of  these  and  other  distinct 
tasks  in  recognition  may  be  implemented  by  a  module  broadly  similar 
to  the  Poggio-Edelman  viewpoint-invariance  network.  We  might  expect 
that  the  system  could  be  decomposed  into  elementary  modules  similar  in 
design  but  different  in  purpose,  some  specific  for  individual  objects  (and 
therefore  solving  a  subordinate-level  task),  some  specific  to  an  object  class 
(solving  a  basic-level  task),  and  others  designed  to  perform  transforma¬ 
tions  or  feature  extractions,  for  example,  common  to  several  classes.  The 
modules  may  broadly  be  categorized  as  follows: 

•  Object-specific.  A  module  designed  to  compensate  for  specific  transfor¬ 
mations  that  a  specific  object  might  undergo.  As  in  the  Poggio-Edelman 
network,  the  module  would  consist  of  a  few  units,  each  maximally  tuned 
to  a  particular  configuration  of  the  object — or  the  face,  say,  a  particular 
combination  of  pose  and  expression.  A  more  general  form  of  the  network 
may  be  able  to  recognize  a  few  different  faces:  its  hidden  units  would  be 
tuned  to  different  views  but  of  not  just  one  face,  and  therefore  behave  more 
like  eigenfaces. 

•  Class-specific.  A  module  that  generalizes  across  objects  of  a  given  class. 
For  example,  the  network  may  be  designed  to  extract  a  feature  or  aspect  of 
any  of  a  class  of  objects,  such  as  pose,  color,  or  distance.  For  example,  there 
might  be  a  network  designed  to  extract  the  pose  of  a  face,  and  a  separate 
network  designed  to  extract  the  direction  of  illumination  on  it.  Any  face 
fed  as  input  to  network  would  elicit  an  estimate  of  its  pose  or  illumination. 

•  Task-specific.  Networks  that  solve  tasks,  such  as  shape-from-shading, 
across  classes  of  objects.  An  example  would  be  a  generic  shape-from- 
shading  network  that  takes  as  input  brightness  gradients  of  image  regions. 
It  might  act  in  the  early  stages  of  recognition,  helping  to  segment  and  clas¬ 
sify  3D  shapes  even  before  they  are  grouped  and  classified  as  objects. 

The  distinctions  between  these  types  of  recognition  module  might  be 
blurred  if,  for  example,  the  visual  system  overleams  certain  objects  or 
transformations.  For  example,  a  shape-from-shading  network  might  de¬ 
velop  for  a  frequently  encountered  type  of  material,  or  for  a  specific  class 
of  object.  Indeed,  our  working  assumption  is  that  any  apparent  differences 
between  recognition  strategies  for  different  types  of  objects  arise  not  from 
fundamental  differences  in  cortical  mechanisms  but  from  imbalances  in  the 
distribution  of  the  same  basic  modules  across  different  objects  and  different 
environments.  Savanna  Man,  like  us,  probably  had  task-specific  modules 
dedicated  to  faces,  but  although  we  might  have  shape-from-shading  mod¬ 
ules  specific  to  familiar  pieces  of  office  furniture,  he  might  not  be  able 
to  recognize  a  filing  cabinet  at  all,  much  less  under  varying  illumination. 
This  suggests  a  decomposition  into  modules  that  are  both  task-  and  object- 
specific,  which  is  a  rather  unconventional  but  plausible  idea. 
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Transformations  specific  to  a  particular  object  may  also  be  generalized 
from  transformations  learned  on  prototypes  of  the  same  class.  For  example, 
the  deformation  caused  by  a  change  in  pose  or,  for  a  face,  a  change  in 
expression  or  age,  may  be  learned  from  a  set  of  examples  of  the  same 
transformation  acting  on  prototypes  of  the  class.  Some  transformations 
may  be  generalized  across  all  objects  sharing  the  same  symmetries  (Poggio 
and  Vetter  1992). 

The  big  question  is,  if  the  recognition  system  does  consist  of  similar 
modules  performing  interlocking  tasks,  how  are  the  modules  linked,  and 
in  what  hierarchy  (if  it  makes  sense  at  all  to  talk  of  ordered  stages)?  In  con¬ 
structing  a  practical  system  for  face  recognition,  it  would  make  sense  first 
to  estimate  the  pose,  expression,  and  illumination  for  a  generic  face  and 
then  to  use  this  estimate  to  "normalize"  the  face  and  compare  it  to  single 
views  in  the  database  (additional  search  to  fine  tune  the  match  may  be  nec¬ 
essary).  Thus  the  system  would  first  employ  a  class-specific  module  based 
on  invariant  properties  of  faces  to  recover,  say,  a  generic  view — analogous 
to  an  object-centered  representation — that  could  feed  into  face-specific  net¬ 
works  for  identification.  The  information  that  the  system  extracts  in  the 
early  stages  concerning  illumination,  expression,  context,  etc.  would  not 
be  discarded.  Within  each  stage,  modules  may  be  further  decomposed  and 
arranged  in  hierarchies:  one  may  be  specific  for  eyes,  and  may  extract  gaze 
angle,  a  parameter  that  may  then  feed  into  a  module  concerned  with  the 
pose  of  the  entire  face. 

For  face  recognition,  the  generic  view  may  be  recovered  by  exploiting 
prior  information  such  as  the  approximate  bilateral  symmetry  of  faces.  In 
general  a  single  monocular  view  of  a  3D  object  (if  shading  is  neglected) 
does  not  contain  sufficient  3D  information  for  recognition  of  novel  views. 
Yet  humans  are  certainly  able  to  recognize  faces  rotated  20-30°  away  from 
frontal  after  training  on  just  one  frontal  view.  One  of  us  has  recently  dis¬ 
cussed  (Poggio  1991)  different  ways  for  solving  the  following  problem: 
from  one  2D  view  of  a  3D  object  generate  other  views ,  exploiting  knowledge  of 
views  of  other,  " prototypical "  objects  of  the  same  class .  It  can  be  shown  theo¬ 
retically  (Poggio  and  Vetter  1992)  that  prior  information  on  generic  shape 
constraints  does  reduce  the  amount  of  information  needed  to  recognize 
a  3D  object,  since  additional  virtual  views  can  be  generated  from  given 
model  views  by  the  appropriate  symmetry  transformations.  In  particular, 
for  bilaterally  symmetric  objects,  a  single  nonaccidental  "model"  view  is 
theoretically  sufficient  for  recognition  of  novel  views.  Psychophysical  ex¬ 
periments  (Vetter  et  al.  1992)  confirm  that  humans  are  better  in  recognizing 
symmetric  than  nonsymmetric  objects. 

An  interesting  question  is  whether  there  are  indeed  multiple  routes  to 
recognition.  It  is  obvious  that  some  of  the  logically  distinct  steps  in  recog¬ 
nition  of  figure  8.2  may  be  integrated  in  fewer  modules,  depending  on  the 
specific  implementation.  Figure  8.5  shows  how  the  same  architecture  may 
appear  if  the  classification  and  the  visualization  routes  are  implemented 
with  HBF  networks.  In  this  case,  the  database  of  face  models  would  es- 
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Image 


Figure  8.5  A  sketch  of  possibly  the  most  compact  (but  not  the  only!)  implementation  of  the 
proposed  recognition  architecture  in  terms  of  modules  of  the  HBF  type. 


sentially  be  embedded  in  the  networks  (see  Poggio  and  Edelman  1990). 

There  are  of  course  several  obvious  alternatives  to  this  architecture  and 
many  possible  refinements  and  extensions.  Even  if  oversimplified,  this 
token  architecture  is  useful  to  generate  meaningful  questions.  The  pre¬ 
ceding  discussion  may  in  fact  be  sufficient  for  performing  computational 
experiments  and  for  developing  practical  systems.  It  is  also  sufficient  for 
suggesting  psychophysical  experiments.  It  is  of  course  not  enough  from 
the  point  of  view  of  a  physiologist,  yet  the  physiological  data  in  the  next 
section  provides  broad  support  for  its  ingredients. 

Physiological  Support  for  a  Memory-Based  Recognition  Architecture 

At  least  superficially,  physiological  data  seems  to  support  the  existence 
of  elements  of  each  these  modules.  Perrett  et  al.  (1985,  1989)  report  evi¬ 
dence  from  inferotemporal  cortex  (IT)  not  only  for  cells  tuned  to  individual 
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faces  but  also  for  face  cells  tuned  to  intermediate  views  between  frontal 
and  profile,  units  that  one  would  expect  in  a  class-specific  network  de¬ 
signed  to  extract  pose  of  faces.  Such  cells  also  support  the  existence  of  the 
view-centered  units  predicted  by  the  basic  Poggio-Edelman  recognition 
module.  Young  and  Yamane  (1992)  describe  cells  in  anterior  IT  that  re¬ 
spond  optimally  to  particular  configurations  of  facial  features,  or  "physical 
prototypes."  These  may  conceivably  provide  input  to  the  cells  described 
by  Perrett  et  al.  as  "person  recognition  units,"  or  to  the  approximately 
view-independent  cells  described  by  Hasselmo  et  al.  (1989)  that  would  in 
turn  correspond  almost  exactly  to  the  object-centered  output  of  the  Poggio- 
Edelman  model.  Perrett  et  al.  (1985,  1989)  also  report  cells  that  respond 
to  a  given  pose  of  the  face  regardless  of  illumination— even  when  the  face 
is  under  heavy  shadow.  Such  cells  may  resemble  units  in  a  task-specific 
network.  In  the  superior  temporal  sulcus,  Hasselmo  et  al.  (1989)  also  find 
cells  sensitive  to  head  movement  or  facial  gesture,  independent  of  the  view 
or  identity  of  the  face.  Such  cells  would  also  appear  to  be  both  class-  and 
task-specific.  (See  Perrett  and  Oram  [1992]  for  a  more  detailed  review  of 
relevant  physiological  data.) 

Fujita  and  Tanaka  (1992)  have  also  reported  cells  in  IT  that  respond 
optimally  to  certain  configurations  of  color  and  shape.  These  may  well 
represent  elements  of  networks  that  generalize  across  objects,  classifying 
them  according  to  their  geometric  and  material  constitution.  More  signif¬ 
icantly,  Fujita  and  Tanaka  (1992)  report  that  cells  in  the  anterior  region  of 
IT  (cytoarchitectonic  area  TE)  are  arranged  in  columns,  within  which  cells 
respond  to  similar  configurations  of  color,  shape,  and  texture.  Each  con¬ 
figuration  may  be  thought  of  as  a  template,  which  in  turn  might  encode  an 
entire  object  (e.g.,  a  face)  or  a  part  of  an  object  (e.g.,  the  lips).  Within  one 
column,  cells  may  respond  to  slightly  different  versions  of  the  template, 
obtained  by  rotations  in  the  image-plane,  for  example.  Fujita  and  Tanaka 
(1992)  conclude  that  each  of  the  2000  or  so  columns  in  TE  may  represent 
one  phoneme  in  the  language  of  objects,  and  that  combinations  of  activity 
across  the  columns  are  sufficient  to  encode  all  recognizable  objects. 

The  existence  of  such  columns  supports  the  notion  that  the  visual  sys¬ 
tem  may  achieve  invariance  to  image-plane  transformations  of  elementary 
features  by  replicating  the  necessary  feature  measurements  at  different  po¬ 
sitions,  at  different  scales  and  with  different  rotations. 

In  the  next  section  we  describe  how  key  aspects  of  the  architecture  could 
be  implemented  in  terms  of  plausible  biophysical  mechanisms  and  neuro¬ 
physiological  circuitries. 

NEURAL  MODELING  OF  MEMORY-BASED  ARCHITECTURES  FOR 
RECOGNITION 

In  this  section  we  discuss  in  more  detail  the  possible  neural  implemen¬ 
tations  of  a  recognition  system  built  from  MBMs.  The  main  questions 
we  address  are:  How  are  MBMs  constructed  when  a  new  object  or  class 
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of  objects  is  learned?  and  How  might  MBM  units  be  constructed  from 
known  biophysical  mechanisms?  We  propose  that  there  are  two  stages 
of  learning— supervised  and  unsupervised— and  illustrate  to  which  ele¬ 
ments  of  a  memory-based  network  they  correspond.  Where  could  they 
be  localized  in  terms  of  cortical  structures?  What  mechanisms  could  be 
responsible?  We  discuss  the  memory-based  module  itself  and  the  circuitry 
that  might  underlie  it. 

The  Leaming-from-Examples  Module 

The  simple  RBF  version  of  an  MBM,  discussed  earlier  in  this  chapter,  learns 
to  recognize  an  object  in  a  straightforward  way.  Its  centers  are  fixed,  chosen 
as  a  subset  of  the  training  examples.  The  only  parameters  that  can  be 
modified  as  the  network  learns  to  associate  each  view  with  the  correct 
response  ("yes"  or  "no"  to  the  target  object)  are  the  coefficients  c„  the 
weights  on  the  connections  from  each  center  to  the  output. 

The  full  HBF  network  permits  learning  mechanisms  that  are  more  bi¬ 
ologically  plausible  by  allowing  more  parameters  to  be  modified.  HBF 
networks  are  equivalent  to  the  following  scheme  for  approximating  a  mul¬ 
tivariate  function: 

n 

f* (*)  =  5Z  C°G(IK*  -  Mllw)  +  p(x)  (8.2) 

a-1 

where  the  centers  t„  and  coefficients  cQ  are  unknown,  and  are  in  general 
fewer  in  number  than  the  data  points  (n  <  N).  The  norm  is  a  weighted  norm 

IK*  -  Ullvv  =  (*  -  t«)TWTtV(*  -  U  (8.3) 

where  W  is  an  unknown  square  matrix  and  the  superscript  T  indicates 
the  transpose.  In  the  simple  case  of  diagonal  W  the  diagonal  elements 
Wj  assign  a  specific  weight  to  each  input  coordinate,  determining  in  fact 
the  units  of  measure  and  the  importance  of  each  feature  (the  matrix  W  is 
especially  important  in  cases  in  which  the  input  features  are  of  a  different 
type  and  their  relative  importance  is  unknown)  (Poggio  and  Girosi  1990a). 
During  learning,  not  only  the  coefficients  c  but  also  the  centers  ta,  and  the 
elements  of  W  are  updated  by  instruction  on  the  input-output  examples 
(figure  8.6). 

Whereas  the  RBF  technique  is  similar  to  and  similarly  limited  as  template 
matching,  HBF  networks  perform  a  generalization  of  template  matching  in 
an  appropriately  linearly  transformed  space,  with  the  appropriate  metric. 
HBF  networks  are  therefore  different  in  both  interpretation  and  capabilities 
from  "vanilla"  RBF.  An  RBF  network  can  recognize  an  object  rotated  to 
novel  orientations  only  if  it  has  centers  corresponding  to  sample  rotations 
of  the  object.  HBFs,  though,  can  perform  a  variety  of  more  sophisticated 
recognition  tasks.  For  example,  HBFs  can 

1.  Discover  the  Basri-Ullman  result  (Basri  and  Ullman  1989;  Brunelli  and 
Poggio,  unpublished).  (In  its  strong  form,  this  result  states  that  under 
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Figure  8.6  A  network  of  the  hyper  basis  functions  type.  For  object  recognition  the  inputs 
could  be  image  measurements  such  as  values  of  different  filters  at  each  of  a  number  of 
locations  in  the  image.  The  network  is  a  natural  extension  of  the  template  matching  scheme 
and  contains  it  as  a  special  case.  The  dotted  lines  correspond  to  linear  and  constant  terms 
in  the  expansion.  The  output  unit  may  contain  a  sigmoidal  transformation  of  the  sum  of  its 
inputs  (see  Poggio  and  Girosi  1990b) 


orthographic  projection  any  view  of  the  visible  features  of  the  3D  object 
may  be  generated  by  a  linear  combination  of  two  other  views.) 

2.  With  a  nondiagonal  W,  recognize  an  object  under  orthographic  projec¬ 
tion  with  only  one  center. 

3.  Provide  invariance  (or  near  invariance  under  perspective  projection) 
for  scale,  rotation,  and  other  uniform  deformations  in  the  image  plane, 
without  requiring  that  the  features  be  invariant. 

4.  Discover  symmetry,  collinearity,  and  other  "linear-class"  properties  (see 
Poggio  and  Vetter  1992). 

Gaussian  Radial  Basis  Functions  In  the  special  case  where  the  network 
basis  functions  are  Gaussian  and  the  matrix  W  diagonal,  its  elements  Wi 
have  an  appealingly  obvious  interpretation.  A  multidimensional  Gaussian 
basis  function  is  the  product  of  one-dimensional  Gaussians  and  the  scale 
of  each  is  given  by  the  inverse  of  For  example,  a  2D  Gaussian  radial 
function  centered  on  t  can  be  written  as 

v-tx)2  - 

G(||x  -  t||vv)  =  £Tl|x-t|,w  =  e  e  lay  ,  (8.4) 
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where  ax  -  1  /w\  and  <jy  =  l/zvi,  and  zoi  and  zv2  are  the  elements  of  the 
diagonal  matrix  W. 

Thus  a  multidimensional  center  can  be  factored  in  terms  of  one-dimensional 
centers.  Each  one-dimensional  center  is  individually  tuned  to  its  input: 
centers  with  small  wu  or  large  07,  are  less  selective  and  will  give  apprecia¬ 
ble  responses  to  a  range  of  values  of  the  input  feature;  centers  with  large 
Wi,  or  small  ax,  are  more  selective  for  their  input  and,  accordingly,  have 
greater  influence  on  the  response  of  the  multidimensional  center.  The 
template  represented  by  the  multidimensional  center  can  be  considered 
as  a  conjunction  of  one-dimensional  templates.  In  this  sense,  a  Gaussian 
HBF  network  performs  the  disjunction  of  conjunctions:  the  conjunctions 
represented  by  the  multidimensional  centers  are  "or"ed  in  the  weighted 
sum  of  center  activities  that  forms  the  output  of  the  network. 

Expected  Physiological  Properties  of  MBM  Units 

The  Neurophysiological  Interpretation  of  HBF  Centers  Our  key  claim 
is  that  HBF  centers  and  tuned  cortical  neurons  behave  alike. 

A  Gaussian  HBF  unit  is  maximally  excited  when  each  component  of 
the  input  exactly  matches  each  component  of  the  center.  Thus  the  unit  is 
optimally  tuned  to  the  stimulus  value  specified  by  its  center.  Units  with 
multidimensional  centers  are  tuned  to  complex  features,  formed  by  the 
conjunction  of  simpler  features,  as  described  in  the  previous  section. 

This  description  is  very  like  the  customary  description  of  cortical  cells 
optimally  tuned  to  a  more  or  less  complex  stimulus.  So-called  place  coding 
is  the  simplest  and  most  universal  example  of  tuning:  cells  with  roughly 
Gaussian  receptive  fields  have  peak  sensitivities  to  given  locations  in  the 
input  space;  by  overlapping,  the  cell  sensitivities  cover  all  of  that  space.  In 
VI  the  input  space  may  be  up  to  five  dimensional,  depending  on  whether 
the  cell  is  tuned  not  only  to  the  retinal  coordinates  x,y  but  also  to  stim¬ 
ulus  orientation,  motion  direction,  and  binocular  disparity.  In  V4  some 
cells  respond  optimally  to  a  stimulus  combining  the  appropriate  values  of 
speed  and  color  (N.  K.  Logothetis,  personal  communication;  Logothetis 
and  Charles  1990).  Other  V4  cells  respond  optimally  to  a  combination  of 
colour  and  shape  (D.  Van  Essen,  personal  communication).  In  MST  cells 
exist  optimally  tuned  to  specific  motions  in  different  parts  of  the  receptive 
field  and  therefore  to  different  motion  "dimensions/'  Most  of  these  cells 
are  also  selective  for  stimulus  contrast.  In  "later"  areas  such  as  IT  cells  may 
be  tuned  to  more  complex  stimuli  which  can  be  changed  in  a  number  of 
"dimensions"  (Desimone  et  al.  1984).  Gross  (1992)  concludes  that  "IT  cells 
tend  to  respond  at  different  rates  to  a  variety  of  different  stimuli."  Thus  it 
seems  that  multidimensional  units  with  Gaussian-like  tuning  are  not  only 
biologically  plausible,  but  ubiquitous  in  cortical  physiology.  This  claim 
is  not  meant  to  imply  that  for  every  feature  dimension  of  a  multidimen- 
sionally  tuned  neuron,  neurons  feeding  into  it  can  be  found  individually 
tuned  to  that  dimension.  For  example,  for  some  motion-selective  cells  in 
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MT  the  selectivities  to  spatial  frequency  and  temporal  frequency  cannot 
be  separated.  Yet  for  these,  it  may  be  inappropriate  to  consider  time  and 
space  as  two  independent  dimensions  and  more  appropriate  to  consider 
velocity  as  the  single  dimension  in  which  the  neuron  is  tuned.  On  the 
other  hand,  it  is  well  known  that  at  lower  levels  in  the  visual  system  there 
do  exist  cells  broadly  timed  individually  to  spatial  frequency,  orientation, 
and  wavelength,  for  example,  and  from  these  dimensions  many  complex 
features  can  be  constructed. 

We  also  observe  that  not  all  MBMs  have  the  same  applicability  in  de¬ 
scribing  properties  of  cortical  neurons.  In  particular,  tuned  neurons  seem 
to  behave  more  like  Gaussian  HBF  units  than  like  the  sigmoidal  units  typi¬ 
cally  found  in  multilayer  perceptrons  (MLPs):  the  tuned  response  function 
of  cortical  neurons  resembles  exp(~(  ||x  - 1||  )2/2<t2,  more  than  it  does  E(x  •  w), 
where  E  is  a  sigmoidal  "squashing"  function  and  we  define  w  as  the  vec¬ 
tor  of  connection  weights  including  the  bias  parameter  6 .  (The  typical 
sigmoidal  response  to  contrast  that  most  neurons  display  may  be  treated 
as  a  Gaussian  of  large  variance.)  For  example,  when  the  stimulus  to  an 
orientation-selective  cortical  neuron  is  changed  from  its  optimal  value  in 
any  direction,  the  neuron's  response  typically  decreases.  The  activity  of 
a  Gaussian  HBF  unit  would  also  decline  with  any  change  in  the  stimulus 
away  from  its  optimal  value  t.  But  for  the  sigmoid  unit  certain  changes 
away  from  the  optimal  stimulus  will  not  decrease  its  activity,  for  example 
when  the  input  x  is  multiplied  by  a  constant  a  >  1. 

Lastly,  we  observe  that  although  the  Gaussian  is  the  simplest  and  most 
readily  interpretable  RBF  in  physiological  terms,  it  might  not  ultimately 
provide  the  best  fit  to  all  the  physiological  data  once  in.  In  espousing  the 
general  theory  of  MBMs  for  cortical  mechanisms  of  object  recognition,  we 
do  not  confine  ourselves  to  Gaussian  RBFs  as  the  only  model  of  cortical 
neurons,  but  only  at  present  the  most  plausible. 

Centers  and  a  Fundamental  Property  of  Our  Sensory  World  We  can 
recognize  almost  any  object  from  any  of  many  small  subsets  of  its  features, 
visual  and  nonvisual.  We  can  perform  many  motor  actions  in  several 
different  ways.  In  most  situations,  our  sensory  and  motor  worlds  are 
redundant .  In  the  language  of  the  previous  section  this  means  that  instead 
of  high-dimensional  centers  any  of  several  lower  dimensional  centers  are  often 
sufficient  to  perform  a  given  task.  This  means  that  the  "and"  of  a  high¬ 
dimensional  conjunction  can  be  replaced  by  the  "or"  of  its  components — a 
face  may  be  recognized  by  its  eyebrows  alone,  or  a  mug  by  its  color.  To 
recognize  an  object,  we  may  use  not  only  templates  comprising  all  its 
features,  but  also  subtemplates,  comprising  subsets  of  features.  This  is 
similar  in  spirit  to  the  use  of  several  small  templates  as  well  as  a  whole-face 
template  in  the  Brunelli-Poggio  work  on  frontal  face  recognition  (Brunelli 
and  Poggio  1992). 

Splitting  the  recognizable  world  into  its  additive  parts  may  well  be 
preferable  to  reconstructing  it  in  its  full  multidimensionality,  because  a 
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system  composed  of  several  independently  accessible  parts  is  inherently 
more  robust  than  a  whole,  simultaneously  dependent  on  each  of  its  parts. 
The  small  loss  in  uniqueness  of  recognition  is  easily  offset  by  the  gain 
against  noise  and  occlusion.  This  reduction  of  the  recognizable  world  into 
its  parts  may  well  be  what  allows  us  to  "understand"  the  things  that  we 
see  (see  appendix  B). 

How  Many  Cells?  The  idea  of  sparse  population  coding  is  consistent  with 
much  physiological  evidence,  beginning  even  at  the  retinal  level  where 
colors  are  coded  by  three  types  of  photoreceptors.  Young  and  Yamane 
(1992)  conclude  from  neurophysiological  recordings  of  IT  cells  broadly 
tuned  to  physical  prototypes  of  faces:  "Rather  than  representing  each  cell 
as  a  vector  in  the  space,  the  cell  could  be  represented  as  a  surface  raised 
above  the  feature  space.  The  height  of  the  surface  above  each  point  in 
the  feature  space  would  be  given  by  the  response  magnitude  of  the  cell 
to  the  corresponding  stimuli  and  population  vectors  would  be  derived  by 
summing  the  response  weighted  surfaces  for  each  cell  for  each  stimulus." 
MBMs  also  suggest  that  the  importance  of  the  object  and  the  exposure  to 
it  may  determine  how  many  centers  are  devoted  to  its  recognition.  Thus 
faces  may  have  a  more  "punctate"  representation  than  other  objects  simply 
because  more  centers  are  used.  Psychophysical  experiments  do  suggest 
that  an  increasing  number  of  centers  is  created  under  extended  training  to 
recognize  a  3D  object  (Bulthoff  and  Edelman  1992). 

While  we  would  not  dare  to  make  a  specific  prediction  on  the  absolute 
number  of  cells  used  to  code  for  a  specific  object,  computational  experi¬ 
ments  and  our  arguments  here  suggest  at  least  a  minimum  bound.  Sim¬ 
ulations  by  Poggio  and  Edelman  (1990)  suggest  that  in  an  MBM  model  a 
minimum  of  10-100  units  is  needed  to  represent  all  possible  views  of  a  3D 
object.  We  think  that  the  primate  visual  system  could  not  achieve  the  same 
representation  with  fewer  than  on  the  order  of  1000.  This  number  seems 
physiologically  plausible,  although  we  expect  that  the  actual  number  will 
depend  strongly  on  the  reliability  of  the  neurons,  training  of  the  animal, 
relevance  of  the  represented  object  and  other  properties  of  the  implemen¬ 
tation.  Thus  we  envisage  that  training  a  monkey  to  one  view  of  a  target 
object  may  "create"  at  least  on  the  order  of  100  cells  tuned  to  that  view 
in  the  relevant  cortical  area,  with  a  generalization  field  similar  to  the  one 
shown  in  figure  8.4.  Training  to  an  additional  view  may  create  or  recruit 
cells  tuned  to  that  view.  Overtraining  a  monkey  on  a  specific  object  should 
result  in  an  overrepresentation  in  cortex  of  that  object — more  cells  than 
normally  expected  would  be  tuned  to  views  of  the  object.  Recent  results 
from  Kobatake  et  al.  (1993)  suggest  that  up  to  two  orders  of  magnitude 
more  cells  may  be  "created"  in  IT  (or,  rather,  the  stimulus  selectivities  of 
existing  cells  altered)  on  overtraining  to  specific  objects. 

Note  that  we  do  not  mean  to  imply  that  only  10  - 1000  cortical  cells  would 
be  active  on  presentation  of  an  object.  Many  more  would  be  activated  than 
those  that  are  critical  for  its  representation.  We  suggest  only  that  the  activity 
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of  approximately  100  cells  should  be  sufficient  to  discriminate  between  two 
distinct  objects.  This  conclusion  is  broadly  supported  by  the  conclusion  of 
Young  and  Yamane  (1992)  that  the  population  response  of  approximately 
40  cells  in  IT  is  approximately  sufficient  to  encode  a  particular  face,  and 
by  the  related  observation  of  Britten  et  al.  (1992)  that  the  activity  of  a  small 
pool  of  weakly  correlated  neurons  in  MT  is  sufficient  to  predict  a  monkey's 
behavioral  response  in  a  motion  detection  task. 


HBF  Centers  and  Biophysical  Mechanisms  How  might  multidimen¬ 
sional  Gaussian  receptive  fields  be  synthesized  from  known  receptive  fields 
and  biophysical  mechanisms? 

The  simplest  answer  is  that  cells  tuned  to  complex  features  are  con¬ 
structed  from  a  hierarchy  of  simpler  cells  tuned  to  incrementally  larger 
conjunctions  of  elementary  features.  This  idea — a  standard  explanation — 
can  immediately  be  formalized  in  terms  of  Gaussian  radial  basis  functions, 
since  a  multidimensional  Gaussian  function  can  be  decomposed  into  the 
product  of  lower  dimensional  Gaussians  (Marr  and  Poggio  1977;  Ballard 
1986;  Mel  1988;  Poggio  and  Girosi  1990b). 

The  scheme  of  figure  8.7  is  a  possible  example  of  an  implementation 
of  Gaussian  radial  basis  functions  in  terms  of  physiologically  plausible 
mechanisms.  The  first  step  applies  to  situations  in  which  the  inputs  are 
place-coded,  that  is,  in  which  the  value  of  the  input  is  represented  by  its 
location  in  a  spatial  array  of  cells — as,  for  example,  the  image  coordinates 
x ,  y  are  encoded  by  the  spatial  pattern  of  photoreceptor  activites.  In  this 
case  Gaussian  radial  functions  in  one,  two,  and  possibly  three  dimensions 
can  be  implemented  as  receptive  fields  by  weighted  connections  from  the 
sensor  arrays  (or  some  retinotopic  array  of  units  whose  activity  encodes 
the  location  of  features).  If  the  inputs  are  interval-coded,  that  is,  if  the 
input  value  is  represented  by  the  continuously  varying  firing  rate  of  a 
single  neuron,  then  a  one-dimensional  Gaussian-like  tuned  cell  can  be 
created  by  passing  the  input  value  through  multiple  sigmoidal  functions 
with  different  thresholds  and  taking  their  difference. 

Consider,  for  example,  the  problem  of  encoding  color.  At  the  retinal 
level,  color  is  recorded  by  the  triplet  of  activities  of  three  types  of  cell:  the 
cone-opponent  red-green  (R-G)  and  blue-yellow  (B-Y)  cells  and  the  lumi¬ 
nance  (L)  cell.  An  R-G  cell  signals  increasing  amounts  of  red  or  decreasing 
amounts  of  green  by  increasing  its  firing  rate.  Thus  it  does  not  behave  like 
a  Gaussian  tuned  cell.  But  at  higher  levels  in  the  visual  system,  there  exist 
cells  that  behave  very  much  like  units  tuned  to  particular  values  in  3D 
color  space  (Schein  and  Desimone  1990).  How  are  these  multidimensional 
tuned  color  cells  constructed  from  one-dimensional  rate-coded  cells?  We 
suggest  that  one-dimensional  Gaussian  tuned  cells  may  be  created  by  the 
above  mechanism,  selective  to  restricted  ranges  of  the  three  color  axes. 

Gaussians  in  higher  dimensions  can  then  be  synthesized  as  products 
of  one-  and  two-dimensional  receptive  fields.  An  important  feature  of 
this  scheme  is  that  the  multidimensional  radial  functions  are  synthesized 
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Figure  8.7  A  three-dimensional  radial  Gaussian  implemented  by  multiplying  a  two- 
dimensional  and  a  one-dimensional  Gaussian  receptive  field.  The  latter  two  functions  are 
synthesized  directly  by  appropriately  weighted  connections  from  the  sensor  arrays,  as  neural 
receptive  fields  are  usually  thought  to  arise.  Notice  that  they  transduce  the  implicit  position 
of  stimuli  in  the  sensor  array  into  a  number  (the  activity  of  the  unit).  They  thus  serve  the 
dual  purpose  of  providing  the  required  "number"  representation  from  the  activity  of  the 
sensor  array  and  of  computing  a  Gaussian  function.  2D  Gaussians  acting  on  a  retinotopic 
map  can  be  regarded  as  representing  2D  "features,"  while  the  radial  basis  function  represents 
the  "template"  resulting  from  the  conjunction  of  those  lower-dimensional  features.  (From 
Poggio  and  Girosi  1989a) 


directly  by  appropriately  weighted  connections  from  the  sensor  arrays, 
without  any  need  of  an  explicit  computation  of  the  norm  and  the  expo¬ 
nential.  From  this  perspective  the  computation  is  performed  by  Gaus¬ 
sian  receptive  fields  and  their  combination  (through  some  approximation 
to  multiplication),  rather  than  by  threshold  functions.  The  view  is  in  the 
spirit  of  the  key  role  that  the  concept  of  receptive  field  has  always  played 
in  neurophsyiology.  It  predicts  a  sparse  population  coding  in  terms  of  low¬ 
dimensional  feature-like  cells  and  multidimensional  Gaussian-like  recep¬ 
tive  fields,  somewhat  similar  to  template-like  cells,  a  prediction  that  could 
be  tested  experimentally  on  cortical  cells. 
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The  multiplication  operation  required  by  the  previous  interpretation  of 
Gaussian  RBFs  to  perform  the  "conjunction"  of  Gaussian  receptive  fields  is 
not  too  implausible  from  a  biophysical  point  of  view.  It  could  be  performed 
by  several  biophysical  mechanisms  (see  Koch  and  Poggio  1987;  Poggio 
1990).  Here  we  mention  several  possibilities: 

1.  Inhibition  of  the  silent  type  and  related  synaptic  and  dendritic  circuitry 
(see  Poggio  and  Torre  1978;  Torre  and  Poggio  1978). 

2.  The  AND-like  mechanism  of  NMDA  receptors. 

3.  A  logarithmic  transformation,  followed  by  summation,  followed  by  ex¬ 
ponentiation.  The  logarithmic  and  exponential  characteristic  could  be  im¬ 
plemented  in  appropriate  ranges  by  the  sigmoid-like  pre-  to  postsynaptic 
voltage  transduction  of  many  synapses. 

4.  Approximation  of  the  multiplication  by  summation  and  thresholding 
as  suggested  by  Mel  (1990). 

If  the  first  or  second  mechanism  is  used,  the  product  of  figure  8.7  can 
be  performed  directly  on  the  dendritic  tree  of  the  neuron  representing  the 
corresponding  radial  function.  In  the  case  of  Gaussian  receptive  fields 
used  to  synthesize  Gaussian  radial  basis  functions,  the  center  vector  is 
effectively  stored  in  the  position  of  the  2D  (or  ID)  receptive  fields  and  in 
their  connections  to  the  product  unit(s).  This  is  plausible  physiologically. 

Linear  terms  (direct  connections  from  the  inputs  to  the  output)  can  be 
realized  directly  as  inputs  to  an  output  neuron  that  summates  linearly  its 
synaptic  inputs.  An  output  nonlinearity  such  as  a  threshold  or  a  sigmoid 
or  a  log  transformation  may  be  advantageous  for  many  tasks  and  will  not 
change  the  basic  form  of  the  model  (see  Poggio  and  Girosi  1989). 

Circuits  There  is  at  least  one  other  way  to  implement  HBFs  networks 
in  terms  of  known  properties  of  neurons.  It  exploits  the  equivalence  of 
HBFs  with  MLP  networks  for  normalized  inputs  (Maruyama  et  al.  1991). 
If  the  inputs  are  normalized  (as  usual  for  unitary  input  representations),  an 
HBF  network  could  be  implemented  as  an  MLP  network  by  using  thresh¬ 
old  units.  There  is  the  problem,  though,  in  normalizing  the  inputs  in  a 
biologically  plausible  way.  MLP  networks  have  a  straightforward  imple¬ 
mentation  in  terms  of  linear  excitation  and  inhibition  and  of  the  threshold 
mechanism  of  the  spike  for  the  sigmoidal  nonlinearity.  The  latter  could 
also  be  implemented  in  terms  of  the  pre-  to  postsynaptic  relationship  be¬ 
tween  presynaptic  voltage  and  postsynaptic  voltage.  In  either  case  this 
implementation  requires  one  neuron  per  sigmoidal  unit  in  the  network. 

Mel  (1992)  has  simulated  a  specific  biophysical  hypothesis  about  the  role 
of  cortical  pyramidal  cells  in  implementing  a  learning  scheme  that  is  very 
similar  to  a  HBF  network.  Marr  (1970)  had  proposed  another  similar  model 
of  how  pyramidal  cells  in  neocortex  could  learn  to  discriminate  different 
patterns.  Marr's  model  is,  in  a  sense,  the  look-up  table  limit  of  our  HBF 
model. 
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Mechanisms  for  Learning 


Reasoning  from  the  HBF  model  we  expect  two  mechanisms  for  learning, 
probably  with  different  localizations,  one  that  could  occur  unsupervised 
and  thus  is  similar  to  adaptation,  and  one  supervised  and  probably  based 
on  Hebb-like  mechanisms. 

The  first  stage  of  learning  would  occur  at  the  site  of  the  centers.  Let  us 
remember  that  a  center  represents  a  neuron  tuned  to  a  particular  visual 
stimulus,  for  example,  a  vertically  oriented  light  bar.  The  coefficients  ca 
represent  the  synaptic  weights  on  the  connections  that  the  neuron  makes 
to  the  output  neuron  that  registers  the  network's  response.  In  the  simple 
RBF  scheme  the  only  parameters  updated  by  learning  are  these  coefficients. 
But  in  constructing  the  network,  the  centers  must  be  set  to  values  equal  to 
the  input  examples.  Physiologically,  then,  selecting  the  centers  ta  might 
correspond  to  choosing  or  retuning  a  subset  of  neurons  selectively  respon¬ 
sive  to  the  range  of  stimulus  attributes  encountered  in  the  task.  This  stage 
would  be  very  much  like  adaptation,  an  adjustment  to  the  prevailing  stim¬ 
ulus  conditions.  It  could  occur  unsupervised,  and  would  strictly  depend 
only  on  the  stimuli,  not  on  the  task. 

The  second  stage,  updating  of  the  coefficients  ca,  could  occur  only  su¬ 
pervised,  since  it  depends  on  the  full  input  and  output  example  pairs,  or,  in 
other  words,  on  the  task.  It  could  be  achieved  by  a  simple  Hebb-type  rule, 
since  the  gradient  descent  equations  for  the  c  are  (Poggio  and  Girosi  1989): 

N 

A|G(||x/  -  ta  llJv)  (8.5) 

1=1 

with  a  =  and  A,  the  squared  error  between  the  correct  out¬ 

put  for  example  i  and  the  actual  output  of  the  network.  Thus  equation 
(8.5)  says  that  the  change  in  the  ca  should  be  proportional  to  the  prod¬ 
uct  of  the  activity  of  the  unit  i  and  the  output  error  of  the  network.  In 
other  words,  the  "weights"  of  the  c  synapses  will  change  depending  on 
the  product  of  pre-  and  postsynaptic  activity  (Poggio  and  Girosi  1989;  Mel 
1988,1990). 

In  the  RBF  case,  the  centers  are  fixed  after  they  are  initially  selected  to 
conform  to  the  input  examples.  In  the  HBF  case,  the  centers  move  to  opti¬ 
mal  locations  during  learning.  This  movement  may  be  seen  as  task-specific 
or  supervised  fine-tuning  of  the  centers'  stimulus  selectivities.  It  is  highly 
unlikely  that  the  biological  visual  system  chooses  between  distinct  RBF-like 
and  HBF-like  implementations  for  given  problems.  It  is  possible,  though, 
that  tuning  of  cell  selectivities  can  occur  in  at  least  two  different  ways, 
corresponding  to  the  supervised  and  unsupervised  stages  outlined  here.  We 
might  also  expect  that  these  two  types  of  learning  of  "centers"  could  occur 
on  two  different  time  scales:  one  fast,  corresponding  to  selecting  centers 
from  a  preexisting  set,  and  one  slow,  corresponding  to  synthesizing  new 
centers  or  refining  their  stimulus  specificities.  The  cortical  locations  of 
these  two  mechanisms,  one  unsupervised,  the  other  supervised,  may  be 
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different  and  have  interesting  implications  on  how  to  interpret  data  on 
transfer  of  learning  (see  Poggio  et  al.  1992). 

For  fast,  unsupervised  learning,  there  might  be  a  large  reservoir  of  cen¬ 
ters  already  available,  most  of  them  with  an  associated  c  =  0,  as  suggested 
by  Mel  (1990)  in  a  slightly  different  context.  The  relevant  ones  would  gain  a 
nonzero  weight  during  the  adaptive  process.  Alternatively,  the  mechanism 
could  be  similar  to  some  of  the  unsupervised  learning  models  described 
by  Linsker  (1990),  Intrator  and  Cooper  (1991),  Foldiak  (1991),  and  others. 

Slow,  supervised  learning  may  occur  by  the  stabilization  of  electrically 
close  synapses  depending  on  the  degree  to  which  they  are  co-activated 
(see,  e.g.,  Mel  1992).  In  this  scheme,  the  changes  will  be  formation  and 
stabilization  of  synapses  and  synapse  clusters  (each  synapse  representing 
a  Gaussian  field)  on  a  cortical  pyramidal  cell  simply  due  to  correlations 
of  presynaptic  activities.  We  suggest  that  this  synthesis  of  new  centers, 
as  would  be  needed  in  learning  to  recognize  unfamiliar  objects,  should 
be  slower  than  selecting  centers  from  an  existing  pool.  But  some  recent 
data  on  perceptual  learning  (e.g.,  Fiorentini  and  Berardi  1981;  Poggio  et 
al.  1992;  Kami  and  Sagi  1990)  indicate  otherwise:  the  fact  that  human 
observers  rapidly  learn  entirely  novel  visual  patterns  suggests  that  new 
centers  might  be  synthesized  rapidly. 

It  seems  reasonable  to  conjecture,  though,  that  updating  of  the  elements 
of  the  W  matrix  may  take  place  on  a  much  slower  time  scale. 

Do  the  update  schemes  have  a  physiologically  plausible  implementa¬ 
tion?  Methods  like  the  random-step  method  (Caprile  and  Girosi  1990), 
that  do  not  require  calculation  of  derivatives,  are  biologically  the  most 
plausible.  (In  a  typical  random-step  method,  network  weight  changes  are 
generated  randomly  under  the  guidance  of  simple  rules;  for  example,  the 
rule  might  be  to  double  the  size  of  the  random  change  if  the  network  per- 
formace  improves  and  to  halve  the  size  if  it  does  not.)  In  the  Gaussian 
case,  with  basis  functions  synthesized  through  the  product  of  Gaussian 
receptive  fields,  moving  the  centers  means  establishing  or  erasing  connec¬ 
tions  to  the  product  unit.  A  similar  argument  can  be  made  also  about  the 
learning  of  the  matrix  W.  Notice  that  in  the  diagonal  Gaussian  case  the 
parameters  to  be  changed  are  exactly  the  a  of  the  Gaussians  (i.e.,  the  spread 
of  the  associated  receptive  fields).  Notice  also  that  the  a  for  all  centers  on 
one  particular  dimension  is  the  same,  suggesting  that  the  learning  of  Wi 
may  involve  the  modification  of  the  scale  factor  in  the  input  arrays  rather 
than  a  change  in  the  dendritic  spread  of  the  postsynaptic  neurons.  In  all 
these  schemes  the  real  problem  consists  in  how  to  provide  the  "teacher" 
input. 

PREDICTIONS  AND  REMARKS 

To  summarize,  we  highlight  the  main  predictions  made  by  our  interpreta¬ 
tion  of  memory-based  models  of  the  brain. 
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1.  Sparse  population  coding.  The  general  issue  of  how  the  nervous  sys¬ 
tem  represents  objects  and  concepts  is  of  course  unresolved.  "Sparse"  or 
"punctate"  coding  theories  propose  that  individual  cells  are  highly  specific 
and  encode  individual  patterns.  "Population"  theories  propose  that  dis¬ 
tributed  activity  in  a  large  number  of  cells  underlies  perception.  Models 
of  the  HBF  type  suggest  that  a  small  number  of  cells  or  groups  of  cells  (the 
centers),  each  broadly  tuned,  may  be  sufficient  to  represent  a  3D  object. 
Thus  our  interpretation  of  MBMs  predicts  a  "sparse  population  coding," 
partway  between  fully  distributed  representations  and  grandmother  neu¬ 
rones.  Specifically,  we  predict  that  the  activity  of  approximately  100  cells 
is  sufficient  to  distinguish  any  particular  object,  although  many  more  cells 
may  be  active  at  the  same  time. 

2.  Viewer-centered  and  object-centered  cells.  Our  model  (see  the  module  of 
figure  8.3)  predicts  the  existence  of  viewer-centered  cells  (the  centers)  and 
object-centered  cells  (the  output  of  the  network).  Evidence  pointing  in  this 
direction  in  the  case  of  face  cells  in  IT  is  already  available.  We  predict  a 
similar  situation  for  other  3D  objects.  It  should  be  noted  that  the  module 
of  figure  8.3  is  only  a  small  part  of  an  overall  architecture.  We  predict 
the  existence  of  other  types  of  cells,  such  as  pose-tuned,  expression-tuned, 
and  illumination-tuned  cells.  Very  recently  N.  Logothetis  (personal  com¬ 
munication)  has  succeeded  in  training  monkeys  to  recognize  the  same 
objects  used  in  human  psychophysics,  and  has  reproduced  the  key  results 
of  Biilthoff  and  Edelman  (1992).  He  also  succeeded  in  measuring  gener¬ 
alization  fields  of  the  type  shown  in  figure  8.4  after  training  on  a  single 
view.  We  believe  that  such  a  psychophysically  measured  generalization 
field  corresponds  to  a  group  of  cells  tuned  in  a  Gaussian-like  manner  to 
that  view.  We  expect  that  in  trained  monkeys,  cells  exist  corresponding 
to  the  hidden  units  of  an  HBF  network,  specific  for  the  training  view,  as 
well  as  possibly  other  cells  responding  to  subparts  of  the  view.  We  conjec¬ 
ture  (although  this  is  not  a  critical  prediction  of  the  theory)  that  the  step  of 
creating  the  tuned  cells  (i.e.  the  centers)  is  unsupervised:  in  other  words, 
that  to  create  the  centers  it  would  be  sufficient  to  expose  monkeys  to  target 
views  without  actually  training  them  to  respond  in  specific  ways. 

3.  Cells  tuned  to  full  views  and  cells  tuned  to  parts.  Our  model  implies  that 
both  high-dimensional  and  low-dimensional  centers  should  exist  for  recog¬ 
nizable  objects,  corresponding  to  full  templates  and  template  parts.  Physi¬ 
ologically  this  corresponds  to  cells  that  require  the  whole  object  to  respond 
(say  a  face)  as  well  as  cells  that  respond  also  when  only  a  part  of  the  object 
is  present  (say,  the  mouth). 

4.  Rapid  synaptic  plasticity.  We  predict  that  the  formation  of  new  centers 
and  the  change  in  synaptic  weights  may  happen  over  short  time  scales 
(possibly  minutes)  and  relatively  early  in  the  visual  pathway  (see  Poggio 
et  al.  1992).  As  we  mentioned,  it  is  likely  that  the  formation  of  new  cen¬ 
ters  is  unsupervised  while  other  synaptic  changes,  corresponding  to  the  c, 
coefficients,  should  be  supervised. 
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HBF-LIKE  MODULES  AND  THEORIES  OF  THE  BRAIN 

As  theories  of  the  brain  (or  of  parts  of  it)  HBFs  networks  replace  computa¬ 
tion  with  memory.  They  are  equivalent  to  modules  that  work  as  interpolat¬ 
ing  look-up  tables.  In  a  previous  paper  one  of  us  has  discussed  how  theories 
of  this  type  can  be  regarded  as  a  modern  version  of  the  "grandmother  cell" 
idea  (Poggio  1990). 

The  proposal  that  much  information  processing  in  the  brain  is  performed 
through  modules  that  are  similar  to  enhanced  look-up  tables  is  attractive 
for  many  reasons.  It  also  promises  to  bring  closer  apparently  orthogonal 
views,  such  as  the  immediate  perception  of  Gibson  (1979)  and  the  representa¬ 
tional  theory  of  Marr  (1982),  since  almost  iconic  "snapshots"  of  the  world 
may  allow  the  synthesis  of  computational  mechanisms  equivalent  to  vision 
algorithms.  The  idea  may  change  significantly  the  computational  perspec¬ 
tive  on  several  vision  tasks.  As  a  simple  example,  consider  the  different 
specific  tasks  of  hyperacuity  employed  by  psychophysicists.  The  proposal 
would  suggest  that  an  appropriate  module  for  the  task,  somewhat  similar 
to  a  new  "routine,"  may  be  synthesized  by  learning  in  the  brain  (see  Poggio 
et  al.  1992). 

The  claim  common  to  several  network  theories,  such  as  multilayer  per¬ 
ceptions  and  HBF  networks,  is  that  the  brain  can  be  explained,  at  least  in 
part,  in  terms  of  approximation  modules.  In  the  case  of  HBF  networks 
these  modules  can  be  considered  as  a  powerful  extension  of  look-up  ta¬ 
bles.  MLP  networks  cannot  be  interpreted  directly  as  modified  look-up 
tables  (they  are  more  similar  to  an  extension  of  multidimensional  Fourier 
series),  but  the  case  of  normalized  inputs  shows  that  they  are  similar  to 
using  templates. 

The  HBF  theory  predicts  that  population  coding  (broadly  tuned  neurons 
combined  linearly)  is  a  consequence  of  extending  a  look-up  table  scheme — 
corresponding  to  interval  coding — to  yield  interpolation  (or  more  precisely 
approximation),  that  is,  generalization.  In  other  words,  sparse  population 
coding  and  neurons  tuned  to  specific  optimal  stimuli  are  direct  and  strong 
predictions  of  HBF  schemes.  It  seems  that  the  hidden  units  of  HBF  models 
bear  suggestive  similarities  with  the  usual  descriptions  of  cortical  neurons 
as  being  tuned  to  optimal  multidimensional  stimuli.  It  is  of  course  possible 
that  a  hierarchy  of  different  networks— for  example,  MLP  networks— may 
lead  to  tuned  cells  similar  to  the  hidden  units  of  HBF  networks. 

APPENDIX  A:  AN  ARCHITECTURE  FOR  RECOGNITION 

The  Classification  and  Indexing  Route  to  Recognition 

Here  we  elaborate  on  the  architecture  for  a  recognition  system  introduced 
in  section  2.  Figure  8.2  illustrates  the  main  components  of  the  architec¬ 
ture  and  its  two  interlocking  routes  to  recognition.  The  first  route,  which 
we  call  the  classification  and  indexing  route,  is  essentially  equivalent  to 
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Figure  8.8  Different  M- arrays  corresponding  to  different  types  of  measurements  (from  left 
to  right:  I,  If  (J),  j A/^  and  dxxl  +  dyy 0-  The  measurements  to  be  used  are  obtained  on  a 
much  coarser  grid  than  the  original  image.  (From  Brunelli  and  Poggio  1992) 


an  earlier  proposal  Poggio  and  Edelman  1990)  in  which  an  HBF  network 
receives  inputs  in  the  form  of  feature  parameters  and  classifies  inputs  as 
same  or  different  from  the  target  object.  This  is  a  streamlined  route  to  recog¬ 
nition  that  requires  that  the  features  extracted  in  the  early  stages  of  image 
analysis  be  sufficient  to  enable  matching  with  samples  in  the  database. 
Its  goal  may  be  primarily  basic  level  recognition,  but  it  is  also  the  route 
that  might  suit  best  the  search  for  and  recognition  of  an  expected  object. 
In  that  case  it  may  be  used  to  identify  objects  (at  the  subordinate  level) 
whose  class  membership  is  known  in  advance.  It  consists  of  three  main 
stages: 

1.  Image  measurements  The  first  step  is  to  compute  a  primal  image  repre¬ 
sentation,  which  is  a  set  of  sparse  measurements  on  the  image,  based  on 
appropriate  smoothed  derivatives,  corresponding  to  center-surround  and 
directional  receptive  fields.  It  can  be  argued  that  the  (vector)  measure¬ 
ments  to  be  considered  should  be  multiple  nonlinear  functions  of  differ¬ 
ential  operators  applied  to  the  image  at  sparse  locations  (for  a  discussion 
of  linear  and  nonlinear  measurement  or  "matching"  primitives  see  Ap¬ 
pendix  in  Nishihara  and  Poggio  1984).  (Similar  procedures  may  involve 
using  Gaussians  of  different  scales  and  orientations  [e.g.,  Marr  and  Poggio 
1977],  Koenderink's  "jets"  [Koenderink  and  VanDoorn  1990],  Gabor  filters, 
or  wavelets.  A  regularized  gradient  of  the  image  also  works  well.)  We  call 
this  array  of  measurements  an  M-array;  in  general,  it  is  a  vector-valued 
array  (figure  8.8).  For  recognition  of  frontal  images  of  faces  an  M-array  as 
small  as  30  x  30  has  been  found  sufficient  to  encode  an  image  of  initial  size 
512  x  512  (Brunelli  and  Poggio  1992). 

2.  Feature  detection  and  measurements.  Key  features,  encoded  by  the  pri¬ 
mal  measurements,  are  then  found  and  localized.  These  features  may  be 
specific  for  a  specific  object  class— for  the  expected  class,  if  it  is  known  in 
advance,  or  for  an  alternative  class  considered  as  a  potential  match.  This 
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step  can  be  regarded  as  performing  a  sort  of  template  matching  with  sev¬ 
eral  appropriate  examples;  when  a  face  is  the  object  of  the  search,  templates 
may  include  eye  pairs  of  different  size,  pose,  and  expression.  In  the  HBF 
case  the  templates  would  effectively  correspond  to  different  centers,  and 
matching  would  proceed  in  a  more  sophisticated  way  than  direct  compar¬ 
ison.  It  is  clear  that  this  step  may  by  itself  accomplish  segmentation. 

3.  Classification  and  indexing.  Parameter  values  estimated  by  the  preceding 
stage  for  the  features  of  interest  (e.g.,  the  distance  between  eyes  and  mouth) 
are  used  in  this  stage  for  classification  and  indexing  in  a  database  of  known 
examples.  In  many  cases  this  may  lead  by  itself  to  unique  recognition, 
especially  when  situational  information,  such  as  the  expectedness  of  a 
particular  object,  restricts  the  relevant  data  base.  Classification  could  be 
done  via  a  number  of  classical  schemes  such  as  nearest  neighbor  or  with 
modules  that  are  more  biologically  plausible  such  as  HBF  networks. 

Some  open  questions  remain: 

•  What  are  the  features  used  by  the  human  visual  system  in  the  feature 
detection  stage?  A  plausible  idea  is  that  there  is  a  large  set  of  filters  tuned 
to  different  2D  shape  features  and  efficiently  doing  a  kind  of  template 
matching  on  the  input.  Some  functional  of  the  correlation  function  is  then 
evaluated  (such  as  the  max  of  the  correlation  or  some  robust  statistics  on 
the  correlation  values).  The  results  represent  some  of  the  components  (for 
that  particular  filter,  i.e.,  template)  of  the  input  vector  to  object-specific 
networks  consisting  of  hidden  units  each  tuned  to  a  view  and  an  output 
unit  which  is  view-invariant.  Networks  of  this  type  may  also  exist  not  only 
for  specific  objects  but  also  for  general  object  components,  perhaps  similar 
to  more  precise  versions  of  some  of  Biederman's  geons  (Biederman  1987). 
They  would  be  synthesized  by  familiarity  and  their  output  may  have  a 
varying  degree  of  view  invariance  depending  on  the  type  and  number 
of  the  tuned  cells  in  the  hidden  layer.  Networks  of  this  type,  tuned  to 
a  particular  shape,  could  easily  be  combined  conjunctively  to  represent 
more  complex  shapes  (but  still  exploiting  the  fundamental  property  of 
additivity).  This  general  scheme  avoids  the  correspondence  problem  since 
the  components  of  the  input  vectors  are  statistics  taken  over  the  whole 
image,  rather  than  individual  pixel  values  or  feature  locations.  It  may 
well  be  that  in  the  absence  of  a  serial  mechanism  such  as  eye  motions  and 
attentional  shifts  the  visual  system  does  not  have  a  way  to  keep  and  use 
spatial  relations  between  different  components  or  feaures  in  an  image  and 
that  it  can  only  detect  the  likely  "presence"  of,  say,  a  few  hundred  features 
of  various  complexity. 

•  The  architecture  described  here  consists  of  a  hierarchy  of  HBF-like  net¬ 
works.  Does  the  human  visual  system  operate  with  a  similar  hierarchy? 
For  instance,  an  eye-recognizing  MBM  network  may  provide  some  of  the 
inputs  to  a  face  recognition  network  that  will  combine  the  presence  (and 
possibly  relative  position)  of  eyes  with  other  face  features  (remember  that 
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an  MBM  network  can  be  regarded  as  a  disjunction  of  conjunctions).  The 
inputs  to  the  eye-recognizing  networks  may  be  themselves  provided  by 
other  RBF-like  networks;  this  is  similar  to  the  use  in  the  eye-recognizing 
networks  of  inputs  that  are  the  result  of  filtering  the  image  through  a  few 
basic  filters  out  of  a  large  vocabulary  consisting  of  hundreds  of  "elemen¬ 
tary"  templates,  representing  a  vocabulary  of  shapes  of  the  type  described 
by  Fujita  and  Tanaka  (1992).  The  description  of  Perrett  and  Oram  (1992)  is 
consistent  with  this  scenario.  At  various  stages  in  this  hierarchy  more  in¬ 
variances  may  be  achieved  for  position,  rotation,  scaling,  etc.,  in  a  manner 
similar  to  how  complex  cells  are  built  from  simple  ones. 

The  Visualization  Route  to  Recognition 

The  second  potential  route  to  recognition  takes  a  necessary  detour  from  the 
first  route  to  fine-tune  the  matching  mechanisms.  Like  the  classification 
pathway  it  begins  with  the  two  stages  of  image  measurement  and  feature 
detection,  but  diverges  because  it  allows  for  the  possibility  that  a  match 
between  the  database  and  measured  image  features  might  not  directly  be 
found.  Further  processing  may  take  place  on  the  image  or  on  the  stored 
examples  to  bring  the  two  into  registration  or  to  narrow  the  range  of  the 
latter.  The  main  purpose  of  this  loop  is  to  correct  for  deformations  before 
comparing  image  to  data  base. 

Computational  arguments  (Breuel  1992)  suggest  that  this  route  should 
separate  transformations  to  be  applied  to  the  image  (to  redress  image- 
plane  deformations  such  as  image-plane  translations,  scaling,  and  rota¬ 
tions)  from  those  to  be  applied  to  the  database  model  (which  may  include 
rotations-in-depth,  illumination  changes,  and  alterations  in  facial  expres¬ 
sion,  for  example).  The  system  may  try  a  number  of  transformations  in 
parallel  and  on  multiple  scales  of  spatial  resolution  (see  chapter  13  by  Van 
Essen,  Anderson,  and  Olshausen)  until  it  finds  the  one  that  succeeds.  In 
general  the  whole  process  maybe  iterated  several  times  before  it  achieves  a 
satisfactory  level  of  confidence  (see  chapter  7  by  Mumford  and  chapter  12 
by  Ullman  for  similar  proposals).  In  the  primate  visual  system,  the  likely 
site  for  the  latter  transformations  is  cortical  area  IT,  whereas  the  former 
would  probably  take  place  earlier,  as  available  results  on  properties  of  IT 
seems  to  suggest  (Gross  1992;  Perrett  et  al.  1982;  Perrett  and  Harries  1988; 
Perrett  et  al.  1989).  The  main  steps  of  this  hypothetical  second  route  to 
recognition  are  as  follows: 

1.  Image  measurement. 

2.  Feature  detection. 

3.  Image  rectification.  The  feature  detection  stage  provides  information 
about  the  location  of  key  features  that  is  used  in  this  stage  to  normalize 
for  image-plane  translation,  scaling  and  image-plane  rotation  of  the  input 
M-array. 
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4.  Pose  estimation .  3D  pose  (two  parameters),  illumination,  and  other  pa¬ 
rameters  (such  as  facial  expression)  are  estimated  from  the  M-array.  This 
computation  could  be  performed  by  an  MBM  module  that  has  "learned" 
the  appropriate  estimation  function  from  examples  of  objects  of  the  same 
class. 

5.  Visualization .  The  models  (M-arrays  in  the  database  corresponding  to 
known  objects)  are  warped  in  the  dimensions  of  pose  and  expression  and 
illumination,  to  bring  them  into  register  with  the  estimate  obtained  from 
the  input  image.  The  transformation  of  the  models  is  performed  by  ex¬ 
ploiting  information  specific  to  the  given  object  (several  views  per  object 
may  have  been  stored  in  memory)  or  by  applying  a  generic  transformation 
(e.g.,  for  a  face,  from  "serious"  to  "smiling")  learned  from  objects  of  the 
same  class.  Several  transformations  may  be  attempted  at  this  stage  before 
a  good  match  is  found  in  the  next  step. 

6.  Verification  and  indexing.  The  rectified  "image"  is  compared  with  the 
warped  database  of  standard  representations.  Open  questions  remain  on 
how  the  database  may  be  organized  and  what  are  the  most  efficient  means 
of  indexing  it. 

APPENDIX  B:  ON  THE  DECOMPOSITION  OF  MULTIDIMENSIONAL 
INPUTS 

It  is  well  known  (see  earlier,  and  Poggio  and  Girosi  1989)  that  the  simplest 
version  of  a  regularization  network  approximates  a  vector  field  y(x)  as 

N 

y(x)  =  ^2  C'G(X  -  x<)  (8-6) 

i=i 

with  G  being  the  chosen  Green  function  and 

(G)cm  =  ym.  (8.7) 

It  follows  that  the  vector  field  is  approximated  as  the  linear  combination 
of  example  fields,  that  is 

N 

y(x)=5Zb,My 1  (8-8) 

Z=1 

where  the  bj  depend  on  the  chosen  G. 

Thus/or  any  choice  of  the  regularization  network — even  HBF — and  any  choice 
of  the  Green  function — including  Green  functions  corresponding  to  additive  splines 
and  tensor  product  splines — the  estimated  output  (vector)  image  is  always  a  linear 
combination  of  example  (vector)  images  with  coefficients  that  depend  (nonlinearly) 
on  the  desired  input  value.  This  observation  together  with  the  fundamental 
hypothesis  suggests  that  an  output  vector  (say  a  vectorized  image)  can  be 
represented  as  a  linear  combination  of  examples.  This  is  similar  to  decom¬ 
position  in  parts. 
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Constructing  Neuronal  Theories  of  Mind 

Michael  I.  Posner  and  Mary  K.  Rothbart 


What  should  a  theory  of  higher  brain  function  be  about?  If  it  is  to  be  a 
theory  of  the  brain,  it  is  clear  that  it  should  be  about  neurons  and  neu¬ 
roanatomy.  Yet  it  is  widely  agreed  that  even  a  complete  description  of 
the  functions  of  neurons  is  unlikely  to  be  adequate  to  provide  a  theory  of 
higher  brain  function.  Nor  do  studies  of  the  brain's  structure  provided  by 
the  new  neuroimaging  methods,  such  as  positron  emission  tomography  or 
magnetic  resonance  imagery,  by  themselves  provide  a  clear  answer  to  the 
question  of  brain  functions.  In  our  view,  the  term  "higher"  indicates  that 
such  a  brain  theory  should  also  be  concerned  with  mind.  If  so,  it  becomes 
important  to  know  what  "mind"  is  like,  how  to  measure  it,  and  what  a 
theory  of  mind  must  explain. 

Many  descriptions  of  mind  begin  with  subjective  experience  and  mea¬ 
sure  "mind"  by  self  reports  given  verbally  or  nonverbally  about  that  expe¬ 
rience.  We  believe  the  task  of  connecting  brain  to  mind  requires  as  fine  an 
analysis  of  mind  as  we  have  been  able  to  make  of  neuronal  activity.  In  our 
view,  the  analysis  of  mind  necessary  to  make  connections  with  brain  sys¬ 
tems  involves  its  specification  into  the  elementary  operations  that  provide 
a  basis  for  localization  of  function  within  neural  systems.  Over  the  past 
25  years  researchers  in  cognitive  science  have  developed  ways  of  defining 
and  measuring  mental  operations. 

FRAMEWORK 

Cognitive  Systems 

We  find  it  useful  to  view  the  connection  between  cognitive  systems  and 
neurosystems  in  terms  of  a  very  general  framework  shown  in  figure  9.1. 
This  framework  involves  five  levels  of  analysis.  At  the  highest  level  is  per¬ 
formance  of  the  tasks  of  every  day  life.  These  include  activities  like  reading, 
recognizing  faces,  daydreaming,  moving  from  place  to  place,  playing  mu¬ 
sic,  writing,  and  planning  a  trip.  Verbal  report  is  often  informative  at  this 
level;  we  can  report  what  faces  look  like  or  our  intention  to  go  someplace. 

A  great  deal  of  evidence  from  studies  of  lesioned  patients  indicates  that 
these  tasks  may  be  grouped  together  into  a  somewhat  lesser  number  of 
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Figure  9.1  Framework  for  linking  cognitive  and  neural  systems. 


"cognitive  systems,"  with  many  tasks  of  daily  life  draw  on  the  same  cogni¬ 
tive  system.  Thus  reading,  writing,  speaking,  and  conversing  are  all  tasks 
of  daily  life  and  they  all  involve  the  cognitive  system  we  call  "language." 
Brain  damage  (lesions)  at  various  locations  of  the  left  hemisphere  impairs 
aspects  of  this  language  system.The  idea  of  a  cognitive  system  has  some 
analogies  with  an  organ  system  in  that  it  is  a  set  of  structures  functioning 
together  to  allow  the  performance  of  a  general  function.  Sometimes  "plan¬ 
ning  ahead"  is  thought  to  involve  a  common  cognitive  system.  Lesions 
of  the  frontal  lobes  often  impair  planning  and  thus  there  may  be  a  brain 
system  that  underlies  it.  Similarly,  selective  attention  appears  to  involve 
particular  brain  systems. 

Elementary  Operations 

Complex  cognitive  tasks  such  as  playing  chess,  reading,  or  manipulating 
visual  images  have  also  been  subjected  to  detailed  analysis.  These  analy¬ 
ses  have  divided  a  task  into  logical  operations  that  might  form  the  basis 
for  programming  a  computer  to  simulate  human  performance.  Consider 
the  task  of  imagining  you  are  walking  along  a  familiar  route.  One  anal¬ 
ysis  of  the  imagery  process  identifies  12  elementary  operations  (Kosslyn 
1980).  Each  operation  has  an  input,  a  computation,  and  an  output.  When 
organized  into  an  appropriate  sequence,  one  has  a  computational  model 
capable  of  performing  an  imagery  task. 

This  approach  to  mental  imagery  is  a  form  of  artificial  intelligence  (AI) 
called  symbol  processing.  Its  emphasis  is  on  the  logic  of  a  set  of  operations 
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that  would  be  sufficient  to  program  or  simulate  the  task  being  studied. 
Although  AI  models  do  not  directly  model  the  brain,  they  sometimes  help 
in  the  construction  of  neural  models  by  making  clear  what  the  logical 
operations  are.  The  logical  analysis  of  the  model  may  or  may  not  relate  to 
what  happens  when  people  actually  perform  the  task.  The  model  provides 
a  kind  of  sufficiency  analysis,  in  that  it  shows  the  task  can  be  analyzed  into 
a  set  of  subroutines  sufficient  to  perform  it. 

Psychological  Pathways 

The  next  step  in  our  effort  to  link  cognitive  processes  to  neural  systems  is  to 
ask  how  a  human  mind  performs  the  postulated  operation.  To  achieve  that 
goal,  we  need  to  design  a  model  task  incorporating  the  operations  under 
study.  How,  for  example,  does  one  generate  a  visual  image?  Suppose  you 
are  presented  with  either  a  visual  or  an  auditory  letter.  Your  task  is  to 
determine  as  quickly  as  possible  if  a  second  (probe)  letter  is  the  same  letter 
as  the  first  (Posner  1978).  If  the  first  letter  is  a  visual  upper  case  A  and  a 
probe  letter  in  the  same  case  is  presented  immediately  (e.g.,  AA),  it  takes 
about  80  msec  less  to  process  than  if  the  probe  is  in  the  opposite  case  (e.g., 
Aa).  Following  a  spoken  letter,  both  upper-  and  lowercase  visual  probes 
also  take  about  80  msec  longer  than  would  a  direct  visual  physical  match. 
After  a  half  second  delay  between  the  first  letter  and  the  probe,  uppercase 
probes  are  handled  just  as  fast,  whether  the  first  letter  is  visual  or  auditory, 
but  lowercase  probes  still  take  longer.  It  takes  about  half  a  second  to 
generate  an  optimal  visual  representation  of  the  auditory  stimulus.  By 
this  objective  test,  you  have  now  generated  an  image. 

How  in  detail  is  an  image  generated?  If  you  are  presented  with  the  letter 
name  F  (Kosslyn  1988)  and  asked  to  form  a  visual  image,  how  can  we  tell 
if  you  are  doing  so?  We  can,  of  course,  ask  you.  But  even  if  we  take  your 
answer  as  definitive,  we  would  have  no  way  of  going  further  to  ask  in 
detail  how  the  image  is  constructed,  because  you  really  have  little  insight 
into  that  level  of  your  processing  system.  To  find  out  if  an  image  has  been 
formed,  a  slightly  different  probe  can  be  used.  This  time  you  are  asked 
whether  a  probe  X  that  appears  is  located  on  or  off  the  image  you  have 
created.  Immediately  after  the  presentation  of  the  first  letter,  you  will  be 
slow  to  verify  whether  the  X  lies  on  the  image  or  not  because  the  image  is 
not  yet  created.  After  a  short  delay  you  are  fast  to  verify  probes  that  lie  on 
the  upright  of  the  letter  but  slow  for  those  on  the  cross  bar.  It  is  as  though 
you  have  generated  the  left  part  of  the  image  but  not  yet  the  rest  of  it.  In 
fact,  these  images  appear  to  be  generated  stroke  by  stroke.  What  is  most 
remarkable  is  that  as  you  generate  them,  verification  of  probes  lying  on 
the  stroke  that  has  already  been  generated  is  facilitated. 

Suppose  you  are  presented  with  an  upright  letter,  and  asked  to  rotate 
clockwise  in  your  mind's  eye  (Cooper  and  Shepard  1975).  Are  you  actually 
performing  the  rotation?  To  test  whether  you  are,  we  probe  with  letters 
at  varying  angles  from  the  upright.  You  are  asked  to  report  by  pressing 
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one  key  if  the  probe  is  a  correct  letter  and  another  key  if  it  is  a  mirror- 
image  letter.  Prior  to  starting  the  experiment  we  calculate  your  rotation 
speed  from  the  reaction  time  to  respond  to  letters  presented  at  different 
angles  of  orientation.  Suppose  your  calculated  rotation  speed  is  100° /sec. 
After  0.3  sec  you  should  be  faster  to  30°  probes  and  slower  to  upright  or  60° 
probes.  That  is  exactly  what  is  found;  you  are  actually  faster  in  responding 
to  a  probe  letter  at  30°  orientation  than  to  one  that  is  upright  at  the  usual 
angle  at  that  we  experience  letters.  Your  mental  rotation  has  objective 
consequences  that  can  be  measured  precisely  in  terms  of  the  time  to  verify 
the  probe,  and  they  are  strong  enough  to  overcome  the  usual  preference 
for  upright  letters  based  on  past  experience. 

It  is  also  possible  to  study  inhibition  of  processing  performance.  Suppose 
you  are  shown  a  red  S  on  top  of  a  green  K  (Allport  1989).  You  are  asked  to 
name  the  red  letter  and  ignore  the  green  letter  that  is  under  it.  On  one  trial 
you  name  S  and  the  rejected  letter  is  K.  If  on  the  next  trial  the  red  letter 
is  a  K  you  will  be  slow  in  naming  it.  When  you  select  the  red  item  you 
also  inhibit  the  green  one.  This  inhibition  remains  present  for  1-2  sec  and 
retards  your  performance  on  the  subsequent  trial. 

These  experiments  have  shown  that  the  performance  of  a  cognitive  op¬ 
eration  can  be  observed  in  terms  of  exquisitely  time-locked  facilitations 
in  the  speed  of  processing  probe  items.  We  call  this  level  of  analysis  the 
"performance  domain"  because  we  are  looking  at  facilitation  or  inhibition 
in  performance  measures  such  as  reaction  time  or  threshold  detection.  The 
use  of  the  words  facilitation  and  inhibition  is  biased  to  make  one  inquire 
whether  such  patterns  are  related  to  the  activity  of  the  populations  of  neu¬ 
ral  cells  that  might  perform  the  computation.  To  answer  the  question  of 
how  facilitation  and  inhibition  measured  in  performance  relate  to  neural 
activity  requires  methods  to  link  mental  operation  to  underlying  neural 
systems  (see  below). 

There  is  a  second  objective  method  for  studying  mental  operations.  In 
addition  to  requiring  time,  they  also  tend  to  show  specific  interference 
when  they  compete  for  the  same  computation.  To  explain  this  feature, 
consider  a  task  involving  timing  (Keele  et  al.  1985).  In  this  task,  subjects 
listen  to  a  tone  occurring  every  0.5  sec  and  tap  a  key  in  synchrony  with 
it.  After  the  tone  is  turned  off  they  continue  to  tap  at  the  same  interval. 
Performance  in  this  task  is  measured  by  the  variability  of  the  key  presses. 
In  the  focal  condition,  subjects  perform  the  task  by  itself,  but  in  a  second 
condition  they  do  it  at  the  same  time  as  a  secondary  task.  Two  different 
secondary  tasks  are  used,  both  designed  so  that  they  do  not  involve  the 
same  input  or  output  mode  as  the  primary  task.  Both  secondary  tasks  are 
designed  to  be  equivalent  in  difficulty  and  to  demand  the  same  amount  of 
attention.  One  task  is  to  determine  if  the  interval  between  one  pair  of  tones 
is  the  same  or  different  as  the  interval  between  a  second  pair.  The  other 
task  involves  the  same  tones,  but  now  the  person  judges  if  the  difference 
in  loudness  between  one  pair  is  the  same  or  different  than  the  second 
pair.  The  interval  judging  secondary  task  involves  an  internal  operation 
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of  timing  to  determine  the  tone  time  differences.  The  loudness  secondary 
task  involves  judgments  of  intensity  that,  unlike  the  interval  task,  do  not 
overlap  with  the  timing  required  by  the  primary  tapping  task.  The  tones 
occur  while  the  subject  is  steadily  pressing  the  key  at  an  interval  that  must 
be  timed  internally.  The  finding  is  that  the  interval  judging  secondary  task 
interferes  much  more  with  the  primary  task  than  does  the  loudness  judging 
task.  When  the  internal  mental  operation  of  timing  is  shared  between  the 
two  tasks,  performance  is  affected,  thus  revealing  an  underlying  hidden 
operation  thought  to  involve  a  clock  that  is  common  to  both  perception 
and  action.  The  use  of  dual-task  methods  has  been  an  important  one  in 
the  objective  measurement  of  mental  operations  (Posner  1978). 

TOOLS  FOR  THEORY  CONSTRUCTION 

Methods  play  a  particularly  important  role  in  every  scientific  endeavor  and 
this  is  certainly  true  of  the  effort  to  relate  the  facilitations  and  inhibitions  in 
the  performance  of  mental  operations  to  their  underlying  neural  systems. 
To  move  from  elementary  operations  and  their  effects  upon  performance  to 
the  level  of  neural  systems  (see  figure  9.1)  it  is  important  to  have  methods 
of  localization.  Although  theories  of  localization  of  mental  function  have 
been  present  for  decades,  methods  for  examining  the  relation  between 
cognitive  function  and  brain  activity  have  been  indirect.  Much  of  the  classic 
work  has  involved  examinations  of  brain  sections  following  death,  with 
damaged  areas  of  the  brain  related  to  the  prior  behavior  of  the  organism.  In 
vivo  examinations  of  the  human  brain  by  measurement  of  electrical  activity 
have  been  possible  for  50  years,  but  the  use  of  imaging  techniques  based  on 
X  ray,  radionuclides,  and  magnetic  resonance  is  more  recent.  We  are  only 
now  developing  appropriate  strategies  to  combine  these  various  methods. 

The  heart  of  the  problem  of  constructing  neural  models  of  cognition  is  to 
move  from  the  level  of  performance  to  underlying  neural  systems  or  micro- 
circuits.  Neuroscience  approaches  have  placed  somewhat  greater  empha¬ 
sis  on  spatial  methods  that  give  hope  for  studying  localization.  Cognitive 
approaches  have  tended  to  place  emphasis  on  the  temporal  organization 
of  information  flow  in  the  nervous  system. 

Cognitive  neuroscience  requires  the  integration  of  methods  that  trace 
the  time  dynamics  of  information  processing  with  those  that  provide  in¬ 
formation  on  the  location  of  neural  systems  activated.  Fortunately  new 
methods  and  adaptations  of  older  methods  have  become  available  in  the 
last  dozen  years.  Two  methods  prominent  in  this  chapter  emphasize  the 
advantage  of  combining  spatial  and  temporal  precision. 

Spatial  Localization 

Positron  emission  tomography  (PET)  is  a  radioactive  tracer  method  of  mea¬ 
suring  cerebral  blood  flow  or  metabolic  activity  that  provides  a  means  of 
tracing  cerebral  activity  during  sustained  cognitive  tasks.  As  used  in  the 
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studies  discussed  here,  oxygen  15-labeled  water  is  injected  (Raichle  1987) 
and  the  water  is  carried  along  with  the  blood  to  various  parts  of  the  brain. 
The  distribution  of  labeled  substance  is  monitored  by  radiation  generated 
when  positrons  are  absorbed.  This  radiation  is  sensed  by  an  array  of  de¬ 
tectors.  While  the  spatial  resolution  of  this  method  is  limited,  the  method 
becomes  quite  accurate  when  successive  scans  are  compared.  It  is  possi¬ 
ble  to  compare  scans  because  the  data  acquired  for  blood  flow  images  can 
be  obtained  within  40  sec.  Differences  between  the  central  tendencies  of 
blood  flow  changes  in  the  two  conditions  can  be  measured  to  within  a  few 
millimeters.  Although  current  PET  methods  are  not  chronometric  in  the 
sense  of  being  sensitive  to  changes  in  milliseconds,  their  probable  physical 
limit  involves  the  rate  at  which  blood  vessels  reflect  neural  changes.  At 
present,  it  is  possible  to  obtain  localization  of  blood  flow  activity  in  the 
range  of  a  few  millimeters,  and  to  look  at  changes  during  tasks  that  last 
for  less  than  a  minute. 

Temporal  Dynamics 

So  far,  the  spatial  imaging  methods  used  with  humans  (e.g.,  PET)  have 
not  provided  the  kind  of  temporal  precision  of  information  required  for 
the  analysis  of  many  cognitive  tasks,  where  differences  of  20-100  msec  are 
frequently  of  theoretical  importance  (Posner  1978).  Event-related  electri¬ 
cal  activity  recorded  from  the  scalp  of  humans  provides  one  method  for 
achieving  high  temporal  information  and  limited  spatial  resolution  (Man- 
gun  et  al.  1992).  The  use  of  event-related  potentials  (ERP)  has  been  quite 
helpful  in  linking  mental  operations  studied  by  chronometric  methods  to 
brain  systems  in  general.  By  combining  PET  and  ERP  studies  we  have 
been  working  on  using  the  former  to  compensate  for  the  more  limited  lo¬ 
calization  possible  with  measurements  on  the  scalp  (Compton  et  al.  1991). 

BUILDING  A  NEURAL  THEORY  OF  MIND 

In  1992  (Posner  and  Rothbart  1992)  we  summarized  evidence  arising  from 
PET  studies  suggesting  that  the  anterior  cingulate  gyrus  plays  a  critical 
role  in  one  aspect  of  attention.  The  anterior  cingulate  lies  on  the  midline  of 
the  frontal  lobe  and  has  strong  connections  with  a  variety  of  other  neural 
areas.  The  aspect  of  attention  related  to  the  anterior  cingulate  is  close  to 
what  is  often  meant  by  consciousness  and  relates  to  both  awareness  and 
to  voluntary  control  (figure  9.2). 

Below  we  summarize  the  major  pieces  of  evidence  that  formed  the  core 
‘  of  the  paper  by  Posner  and  Rothart  (1992). 

The  anterior  attention  network  seems  to  be  much  more  directly  related 
to  awareness  than  the  posterior  network,  as  has  been  indicated  by  the 
PET  studies  cited  previously.  The  use  of  subjective  experience  as  evidence 
for  a  brain  process  related  to  consciousness  has  been  criticized  by  many 
authors.  However,  we  note  that  the  evidence  for  the  activation  of  the  an- 


188 


Posner  and  Rothbart 


LEFT 


RIGHT 


Figure  9.2  The  cortical  projections  of  the  attention  networks.  The  data  are  mainly  from  PET 
studies.  The  attentional  networks  are  shown  by  solid  shapes  on  the  lateral  and  medial  surfaces 
of  the  right  and  left  hemisphere.  Squares  are  the  posterior  attention  network  (parietal  lobes), 
triangles,  the  vigilance  network  (right  frontal),  and  diamonds,  the  anterior  attention  network 
(anterior  cingulate,  supplementary  motor  area).  The  open  shapes  refer  to  word-processing 
systems  (ellipse,  visual  word  form;  circle,  semantic  associates)  that  have  been  shown  to  relate 
to  the  posterior  and  anterior  attention  systems,  respectively. 


terior  cingulate  is  entirely  objective;  it  does  not  rest  upon  any  subjective 
report.  Nevertheless,  if  one  defines  consciousness  in  terms  of  awareness, 
it  is  necessary  to  show  evidence  that  the  anterior  attention  network  is  re¬ 
lated  to  phenomenal  reports  in  a  systematic  way.  In  this  section,  we  note 
five  points ,  each  of  which  appears  to  relate  subjective  experience  to  activa¬ 
tion  of  the  anterior  attention  system.  First,  the  degree  of  activation  of 
this  network  increases  with  the  number  of  targets  presented  in  a  semantic 
monitoring  task  and  decreases  with  the  amount  of  practice  in  the  task.  At 
first  one  might  suppose  that  target  detection  is  confounded  with  task  dif¬ 
ficulty.  But  in  our  semantic  monitoring  task  the  same  semantic  decision 
must  be  made  irrespective  of  the  number  of  actual  targets.  In  our  tasks 
no  storage  or  counting  of  targets  was  needed.  Thus  we  effectively  dissoci¬ 
ated  target  detection  from  task  difficulty.  Nonetheless,  anterior  cingulate 
activation  was  related  to  number  of  targets  present.  The  increase  in  activa¬ 
tion  with  number  of  targets  and  reduction  in  such  activation  with  practice 
corresponds  to  the  common  finding  in  cognitive  studies  that  conscious  at¬ 
tention  is  involved  in  target  detection  and  is  required  to  a  greater  degree 
early  in  practice  (Fitts  and  Posner  196 7).  As  practice  proceeds,  feelings  of 
effort  and  continuous  attention  diminish  and  details  of  performance  drop 
out  of  subjective  experience. 

Second,  the  anterior  system  appears  to  be  active  during  tasks  requiring 
the  subject  to  detect  visual  stimuli,  when  the  targets  involve  color  form, 
motion,  or  word  semantics  (Petersen  et  al.  1989;  Corbetta  et  al.  1990). 

Third,  the  anterior  attention  system  is  activated  when  listening  pas- 
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sively  words,  but  not  when  watching  those  words.  This  finding  appears  to 
correspond  subjectively  to  the  intrusive  nature  of  auditory  words  to  con¬ 
sciousness  when  they  are  presented  in  a  quiet  background.  They  seem  to 
capture  awareness.  Reading  does  not  have  this  intrusive  character.  For  a 
visual  word  to  dominate  awareness,  an  act  of  visual  orienting  is  needed  to 
boost  its  signal  strength. 

Fourth,  the  anterior  attention  system  is  more  active  during  conflict 
blocks  of  the  Stroop  test  than  during  nonconflict  blocks  (Pardo  et  al.  1990). 
This  is  consistent  with  the  commonly  held  idea  that  conflict  between  word 
name  and  ink  color  produces  a  strong  conscious  effort  to  inhibit  saying  the 
written  word  (Posner  1978).  Finally,  there  is  a  relation  between  the  vigi¬ 
lance  system  and  awareness.  When  one  attends  to  a  source  of  sensory  input 
in  order  to  detect  an  infrequent  target,  the  subjective  feeling  is  of  emptying 
the  head  of  thoughts  or  feelings.  This  subjective  "clearing  of  conscious¬ 
ness"  appears  to  be  accompanied  by  an  increase  in  activation  of  the  right 
frontal  lobe  vigilance  network  and  a  reduction  in  the  anterior  cingulate. 
Just  as  feelings  of  effort  associated  with  target  detection  or  inhibiting  pre¬ 
potent  responses  are  accompanied  by  evidence  of  cingulate  activation,  so 
the  clearing  of  thought  is  accompanied  by  evidence  of  cingulate  inhibition, 
(pp.  97-99) 

Both  those  who  read  the  paper  and  the  authors  were  somewhat  uneasy 
about  one  aspect  of  its  content.  It  is  not  an  easy  thing  to  be  writing  some¬ 
thing  that  might  be  interpreted  as  meaning  there  is  a  spot  inside  the  nervous 
system  that  represents  the  neural  correlate  of  consciousness.  This  is  prob¬ 
ably  even  more  true  following  Dennett's  (1992)  philosophical  critique  of 
those  who  implicitly  cling  to  a  view  that  one  area  is  the  arena  of  conscious¬ 
ness  or  what  he  calls  the  Cartesian  Theater  of  the  Mind.  Nonetheless,  the 
specific  points  made  above  seemed  to  us  to  identify  cingulate  activation 
with  aspects  of  awareness  in  so  much  tighter  a  way  than  previous  efforts 
that  it  was  reasonable  to  set  them  down  with  as  much  clarity  as  possible. 

Since  writing  that  paper,  much  has  happened  both  to  increase  our  uneasi¬ 
ness  with  identifying  consciousness  with  precise  brain  coordinates  and  to 
allow  us  to  build  out  from  that  rather  uncomfortable  position  by  specifying 
in  somewhat  more  detail  what  role  the  anterior  cingulate  might  actually 
play. 

Reentry 

One  new  development  was  our  increased  understanding  of  how  the  brain 
actually  executes  a  voluntary  instruction  to  attend  to  something  (Grossen- 
bacher  et  al.  1991).  Studies  employing  PET  have  revealed  important  anatom¬ 
ical  aspects  of  word  reading  (Petersen  et  al.  1989, 1990).  Two  major  areas 
of  activation  appear  within  the  visual  system.  A  right  posterior  temporal 
parietal  area  is  activated  passively  by  both  consonant  strings  and  words. 
This  activation  appears  to  be  enhanced  when  subjects  are  required  to  de¬ 
tect  a  feature.  This  area  is  thought  to  be  a  visual  representation  that  is 
prelexical.  A  left  ventral  occipital  area  is  activated  by  both  words  and 
pseudowords  (e.g.,  tweal),  but  not  by  consonant  letter  strings.  The  loca- 
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tion  and  properties  of  this  left  posterior  activation  suggest  it  is  involved  in 
what  is  called  the  visual  word  form.  The  visual  word  form  is  a  representa¬ 
tion  of  the  orthography  of  the  letter  string  in  which  individual  letters  are 
combined  into  a  single  chunk. 

These  recent  PET  findings,  together  with  many  cognitive  studies,  imply 
that  letter  strings  are  represented  within  the  visual  system  both  as  unorga¬ 
nized  features  or  letters  and  within  a  unified  visual  word  form.  However, 
the  PET  images  used  in  these  studies  require  an  average  of  40  sec  and  could 
also  involve  feedback  from  more  anterior  to  posterior  areas.  For  example, 
the  occipital  visual  word  form  may  be  set  up  only  after  the  subject  accesses 
the  word  meaning. 

To  study  the  time  course  of  word  processing,  we  developed  tasks  that 
take  advantage  of  another  recent  PET  result  showing  that  when  subjects 
attend  to  color,  motion,  or  form,  appropriate  posterior  prestriate  areas 
are  increased  in  activation.  Attention  appears  to  amplify  the  activity  of 
anatomical  areas  in  which  the  related  computations  occur  (Corbetta  et  al. 
1990).  We  considered  a  task  that  requires  the  subject  to  deal  with  the 
individual  features  of  a  letter  (Compton  et  al.  1991).  To  study  attention  to 
visual  features,  subjects  are  required  to  detect  a  line  thickening  within  one 
letter  of  a  four-  or  six-letter  word  or  consonant  string.  To  link  cognitive 
results  with  anatomy  we  studied  these  tasks  along  with  passive  perception 
while  recording  electrical  activity  from  32-64  electrodes  positioned  over 
occipital,  temporal,  parietal,  and  frontal  areas. 

The  results  of  this  study  provide  encouragement  for  the  effort  to  relate 
PET  anatomical  data  to  time  dynamic  ERP  data.  We  found  a  very  strong 
posterior  tempoparietal  asymmetry  in  which  electrical  activity  at  about 
100  msec  is  larger  from  the  right  hemisphere  than  from  the  left.  This  effect 
is  quite  strong,  but  only  at  temporal  and  inferior  parietal  sites.  The  effect 
fits  with  the  idea  of  a  right  posterior  generator  related  to  visual  features 
because  it  occurs  within  the  first  100  msec  and  similarly  for  word  and 
nonword  strings  in  all  task  blocks. 

We  asked  subjects  to  respond  to  the  presence  of  a  thick  feature  by  pressing 
one  key  if  it  was  present  and  another  if  it  was  not.  There  was  no  difference 
between  the  strings  that  made  words  and  those  that  did  not.  On  trials  when 
a  target  was  present,  reaction  times  appeared  to  reflect  the  distance  of  the 
target  from  the  center  of  vision.  Moreover,  the  differences  in  reaction  time 
between  four  and  six  letters  (the  slope  of  the  search  function)  was  about 
the  same  whether  the  target  was  present  or  absent.  If  subjects  had  been 
searching  serially  and  stopped  when  they  found  a  target,  the  slopes  for  the 
target  absent  trials  would  be  twice  that  for  the  target  present  trials,  since 
when  a  target  was  present  subjects  could  respond  as  soon  as  they  detected 
a  target,  which  on  the  average  would  require  searching  only  half  the  list. 
These  results  make  it  seem  reasonable  that  when  attending  to  features, 
subjects  search  a  representation  located  in  the  right  posterior  temporal  lobe. 

The  second  anatomical  area  relates  to  the  visual  word  form  system  of 
the  left  ventral  occipital  lobe.  To  study  the  difference  between  words  and 
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consonant  strings  we  superimposed  their  event-related  potentials.  While 
differences  between  the  two  stimulus  types  are  found  in  a  number  of  elec¬ 
trodes,  these  appear  to  occur  first  along  a  posterior  band  of  areas  extending 
from  the  right  posterior  temporal  lobe  to  the  left  anterior  temporal  lobe. 
These  findings  are  generally  consistent  with  the  location  of  the  PET  gener¬ 
ator  in  the  anterior  left  occipital  lobe  near  the  midline,  although  the  degree 
of  anatomical  localization  from  the  scalp  voltage  data  is  low  and  the  evi¬ 
dence  for  lateral  asymmetry  in  the  voltage  data  is  not  strong.  We  are  seeing 
differences  over  a  large  part  of  the  posterior  scalp  of  both  hemispheres.  It 
is  possible  to  gain  a  somewhat  better  notion  of  the  localization  of  these 
effects  if  the  data  are  transformed  by  use  of  an  average  reference  based  on 
all  of  the  electrodes  other  than  the  one  being  considered,  weighted  by  their 
distance  from  the  active  electrode  site.  This  transformation  suggests  a  left 
posterior  generator  in  the  neighborhood  of  the  posterior  temporal  lobe. 

If  the  ERP  effect  is  coming  from  the  visual  word  form  area,  our  findings 
suggest  this  area  is  making  the  initial  discrimination  between  words  and 
nonwords  starting  at  about  200  msec  after  input.  Since  the  PET  result 
is  averaged  over  40  sec  of  activity,  activation  in  posterior  locations  could 
have  been  fed  back  from  some  more  anterior  area.  However,  the  ERP  data 
clearly  suggest  that  in  the  passive  conditions,  the  posterior  discrimination 
is  being  made  first,  because  no  other  electrodes  show  this  difference  prior 
to  the  posterior  ones. 

In  PET  studies  an  area  of  the  left  frontal  area  is  active  when  subjects  deal 
with  the  meaning  of  a  word.  When  the  process  is  extended  by  requiring 
association  of  several  words  to  a  given  input  or  by  slowing  the  rate  of 
presentation,  this  left  frontal  area  is  joined  by  activation  in  Wernicke's 
area  (Fiez  and  Petersen,  1993).  To  study  semantic  and  feature  activation 
together  we  used  tasks  that  clearly  involved  both.  The  feature  task  was 
again  looking  for  a  thick  letter;  the  semantic  task  required  the  subject  to 
determine  if  the  word  referred  to  a  natural  or  a  manufactured  item.  Our 
reasoning  was  that  in  the  first  task,  subjects  would  be  attending  to  the 
feature  level  and  in  the  semantic  task  to  the  meaning  of  the  word.  If,  as  has 
been  described  in  the  PET  work,  attention  serves  to  amplify  computations, 
it  should  be  possible  to  see  amplifications  of  the  voltages  in  the  waveforms 
in  the  right  posterior  area  in  feature  analysis  and  in  the  left  frontal  area  in 
semantics,  depending  on  the  task  used.  We  used  exactly  the  same  stimuli 
in  the  two  tasks. 

Results  showed  that  the  left  frontal  area  was  more  positive  at  about 
200-300  msec  when  the  task  was  semantic,  while  the  right  posterior  area 
showed  more  positivity  when  the  task  was  feature  search.  These  effects 
were  not  confined  to  single  electrode  sites.  For  the  posterior  area,  it  was 
possible  to  compare  the  electrode  sites  first  showing  the  greater  right  hemi¬ 
sphere  activation  at  100  msec  associated  with  the  visual  attribute,  with 
those  showing  the  amplification  due  to  attribute  search  at  about  250  msec. 
Our  comparison  generally  supported  the  idea  that  roughly  the  same  areas 
that  first  carried  out  the  visual  attribute  computations  on  the  letter  string 
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were  reactivated  150  msec  later  when  subjects  were  looking  for  the  thick 
letter.  This  fit  with  the  idea  that  subjects  can  voluntarily  reactivate  areas  of 
the  brain  that  performed  the  task  automatically  when  they  are  instructed 
to  deal  with  that  computation  voluntarily.  The  semantic  effect  was  found 
at  several  frontal  sites  bilaterally.  This  differs  from  the  PET  data,  which 
were  strictly  left  lateralized,  although  there  was  some  evidence  that  the 
left  frontal  area  showed  the  effect  more  strongly  than  the  right. 

A  popular  idea  in  modem  physiology  is  called  reentrant  processing 
(Edelman  and  Mountcastle  1978).  Basically,  this  is  the  idea  that  higher  level 
associations  are  made  by  fibers  that  reenter  the  brain  areas  that  processed 
the  initial  input.  Mountcastle  has  written  about  the  basic  organization  of 
cortical  anatomy  as  follows: 

It  is  well  known  from  classical  neuroanatomy  that  many  of  the  large  entities 
of  the  brain  are  interconnected  by  extrinsic  pathways  into  complex  systems, 
including  massive  reentrant  circuits. 

Simulations  based  on  the  coordination  of  wide  spread  neural  systems 
also  rely  upon  this  principle,  as  described  by  Sporns  et  al.  (1989): 

Signaling  between  neuronal  groups  occurs  via  excitatory  connections  that 
link  cortical  areas,  usually  in  a  reciprocal  fashion.  According  to  the  theory 
of  neural  group  selection,  selective  dynamical  links  are  formed  between 
distant  neural  groups  via  reciprocal  connections  in  a  process  called  reentry. 
Reentrant  signaling  establishes  correlations  between  cortical  maps,  within 
or  between  different  levels  of  the  nervous  systems. 

Reentrant  processing  may  be  contrasted  with  more  traditional  notions 
that  higher  functions  are  confined  to  higher  associational  areas  of  the  brain. 
A  similar  viewpoint  to  reentrant  processing  is  expressed  by  Damasio  and 
Damasio  (chapter  3).  In  our  studies,  the  visual  computation  occurs  at  100 
msec,  followed  by  a  semantic  computation  which  might  be  complete  by 
200-300  msec.  When  the  instruction  is  to  search  the  string  for  a  feature,  the 
electrodes  around  the  area  originally  performing  the  visual  computation 
are  reactivated.  Similarly,  when  asked  to  make  a  semantic  computation, 
the  area  thought  to  perform  such  computations  was  amplified  in  electrical 
activity  about  100  msec  after  its  initial  computation. 

If  the  brain  operates  in  this  way  we  might  then  be  able  to  instruct  the 
subject  to  compute  the  same  functions  in  different  orders  and  thus  repro¬ 
gram  the  order  of  the  underlying  computations.  To  investigate  this,  we 
(Grossenbacher  et  al.  1991)  defined  what  we  call  a  conjunction  task.  We 
ask  the  subjects  to  respond  with  a  key  if  a  word  refers  to  an  object  that  is 
manufactured  (e.g.,  paper)  and  has  a  thick  letter  and  otherwise  to  respond 
with  a  second  nontarget  key.  On  one  day  we  have  the  subjects  perform 
the  thick  letter  task  and  then  ask  them  to  respond  with  the  target  key  if 
the  stimulus  has  a  thick  letter  and  refers  to  a  manufactured  object.  On  an¬ 
other  day  we  have  subjects  perform  the  semantic  task  and  then  ask  them 
to  respond  with  the  target  key  if  the  word  is  manufactured  and  has  a  thick 
letter.  The  function  to  be  computed  is  thus  exactly  the  same.  The  inputs 
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are  identical  and  the  responses  (if  correct)  are  identical,  but  the  order  of 
the  underlying  computations  is  reversed. 

We  did  not  expect  the  subjects  to  actually  compute  the  functions  in  a 
serial  fashion.  Our  hope  was  only  that  they  would  emphasize  the  priority 
computation  and  perhaps  complete  it  somewhat  earlier  than  the  second 
nonpriority  computation.  To  see  if  this  happens  one  can  look  at  the  reaction 
time  data  from  this  task.  We  examined  the  nontarget  reaction  times  where 
subjects  can  quit  when  they  find  either  computation  inappropriate  for  a 
target.  In  general  the  thick  letter  task  is  somewhat  faster;  subjects  quit 
sooner  when  there  is  no  thick  letter  present.  They  are  also  relatively  faster  if 
the  thick  letter  task  has  been  given  priority  by  training.  On  the  other  hand, 
if  a  thick  letter  is  present  and  responses  must  be  based  on  semantic  analysis, 
subjects  are  faster  if  the  semantics  was  given  priority  by  training.  Thus  the 
reaction  time  data  suggest  that  we  have  been  successful  in  reordering  the 
computation  times. 

We  can  now  look  at  images  of  the  underlying  brain  activity  recorded 
from  above  the  left  frontal  or  right  posterior  areas.  The  two  forms  of  the 
conjunction  task  differ  at  about  300  msec  following  input.  In  the  left  frontal 
area,  the  semantic  priority  task  returns  to  baseline  first;  later,  the  physical 
priority  task  returns  to  baseline.  A  reversed  effect  is  found  in  the  posterior 
area.  This  time  the  physical  priority  task  returns  first  to  baseline  followed 
by  the  semantic  priority  task.  The  two  brain  areas  seem  to  reflect  the 
relative  priority  emphasized  in  the  directions.  We  believe  that  the  subject 
has  used  attention  to  program  the  relative  order  of  the  two  computations 
represented  by  the  two  anatomical  areas. 

These  results  suggest  that  the  person  is  able  to  reorder  the  priority  of 
the  underlying  computations  in  the  conjunction  task.  They  also  provide 
us  with  a  basis  for  understanding  how  the  brain  can  carry  out  so  many 
different  tasks  on  visual  input.  Aspects  of  the  underlying  computations  do 
not  seem  affected  by  the  instructions.  The  visual  attribute  area  of  the  right 
posterior  brain  seems  to  carry  out  the  computation  on  the  input  string  at  100 
msec  irrespective  of  whether  the  person  is  concerned  with  visual  features 
as  a  part  of  the  task  or  not.  However,  when  the  task  is  identified  as  looking 
for  a  thick  letter  these  same  brain  areas  are  reactivated  and  presumably 
carry  out  the  additional  computations  necessary  to  make  sure  that  one  of 
the  letters  has  just  enough  thickening  to  constitute  a  target.  Attention  thus 
can  amplify  computations  within  particular  areas,  but  often  does  so  by 
reentering  the  area,  not  by  amplifying  its  initial  activation. 

Models  of  Control 

The  data  on  reentrant  processing  imply  two  important  ideas  related  to  the 
attentional  control  of  information  processing.  First,  the  results  of  atten- 
tional  control  are  widely  distributed,  resulting  in  amplification  of  activity 
in  the  anatomical  areas  that  originally  computed  that  information.  Sec¬ 
ond,  the  source  of  this  attentional  control  need  not  involve  a  system  that 
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has  access  to  the  information  being  amplified,  but  can  be  a  system  that 
has  connections  to  places  where  the  computations  occur.  This  sense  of 
control  by  a  separate  attentional  system  over  widely  distributed  computa¬ 
tions  is  suggested  physiologically  by  Van  Essen,  Anderson,  and  Olshausen 
(chapter  13)  and  computationally  by  Ullman  (chapter  12). 

As  the  result  of  activity  within  the  attention  network,  the  relevant  brain 
areas  will  be  amplified  and/or  irrelevant  ones  inhibited,  leaving  the  brain 
to  be  dominated  by  the  selected  computations.  If  this  were  the  correct  the¬ 
ory  of  attentional  control,  one  would  expect  to  find  the  source  of  attention 
to  lie  in  systems  widely  connected  to  other  brain  areas,  but  not  otherwise 
unique  in  structure.  As  pointed  out  by  Goldman-Rakic  (1988),  this  appears 
to  be  the  basic  organization  of  frontal  networks.  Anterior  cingulate  connec¬ 
tions  to  limbic,  thalamic,  and  basal  ganglia  pathways  would  distribute  its 
activity  to  the  widely  dispersed  connections  we  have  seen  to  be  involved 
in  cognitive  computations. 

To  illustrate  this  framework  for  attention,  we  use  a  recent  model  of 
control  of  covert  visual  attention  developed  at  our  center  by  Jackson  and 
Houghton  (1992).  The  model  involves  location  expectations  held  by  the 
anterior  attention  network  interacting  with  location  cues  that  influence  the 
posterior  attention  network.  The  basic  architecture  of  the  model  is  shown 
in  figure  9.3. 

The  posterior  attention  network  (including  the  parietal  lobe  and  asso¬ 
ciated  thalamic  and  midbrain  areas)  and  the  anterior  attention  network 
(including  the  anterior  cingulate)  influence  each  other  via  direct  cortical 
projections,  but  also  indirectly  through  a  comparator  operation  involving 
the  basal  ganglia.  The  direct  loops  have  the  effect  of  allowing  activations 
at  common  locations  in  the  two  systems  to  support  one  another.  A  sensory 
event  facilitates  processing  at  a  location  due  to  activation  of  the  poste¬ 
rior  network,  but  an  expectation  also  operates  via  the  anterior  network  to 
facilitate  the  expected  location. 

The  role  of  the  basal  ganglia  loops  are  more  complex.  A  direct  path¬ 
way  between  the  anterior  cingulate  and  striatum  serves  as  a  reverberating 
circuit  to  maintain  expected  locations  and  to  amplify  them  when  their  lo¬ 
cations  match.  The  indirect  pathway  operates  when  there  is  a  mismatch 
between  the  two  attention  systems  to  dampen  down  activation  within  the 
anterior  system  at  any  locations  for  which  there  is  no  input  from  the  pos¬ 
terior  attention  system.  This  allows  expectations  to  be  overcome. 

The  resulting  system  is  rather  complicated  but  it  is  constrained  by  the 
anatomical  structures  of  the  relevant  components.  It  make  predictions  of 
performance  in  cognitive  experiments  using  cues  and  targets.  For  example, 
the  model  can  predict  some  reaction  time  results  accumulated  from  cogni¬ 
tive  experiments  involving  manipulations  such  as  lesions  of  the  posterior 
system,  blocking  of  NE  or  DA  into  the  system,  competition  from  dual  tasks, 
etc.  We  do  not  wish  to  present  the  model  as  a  final  answer  to  the  coordina¬ 
tion  of  attentional  systems,  but  merely  to  show  that  the  logic  of  the  opera¬ 
tions  we  have  discussed  can  be  embodied  in  a  functioning  computer  model. 
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Figure  9.3  Architecture  indicating  the  role  of  attention  networks  in  covert  orienting  to  visual 
locations.  The  top  of  the  figure  indicates  the  areas  of  the  posterior  attention  network  (e.g., 
parietal  cortex)  and  anterior  attention  network  (e.g.,  anterior  cingulate).  The  lower  parts  of 
the  figure  indicate  areas  of  basal  ganglia  and  its  connections.  The  open  arrows  represent 
excitatory  and  the  closed  arrows  inhibitory  connections. 
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The  type  of  attentional  control  of  orienting  suggested  above  can  thus 
lead  to  specific  simulations  allowing  tests  at  the  cognitive  level  in  consid¬ 
erable  detail.  Recent  PET  data  (Corbetta  et  al.  1993)  show  that  involuntary 
shifts  of  attention  to  visual  locations  induced  by  cues  produce  strong  supe¬ 
rior  parietal  activation  within  the  posterior  system.  When  a  subject  must 
shift  attention  endogenously  to  report  a  target,  this  activation  is  accompa¬ 
nied  by  frontal  and  anterior  cingulate  activation,  as  would  be  expected  if 
endogenous  shifts  are  controlled  from  frontal  areas. 

Some  recent  developmental  findings  (Posner  and  Rothbart  1992)  show 
that  the  distributed  connections  by  which  the  attention  systems  assume 
control  over  various  functions  may  develop  over  a  considerable  period  of 
life.  The  control  of  orienting  by  the  posterior  attention  system  appears  to 
develop  mainly  between  4  and  6  months.  During  this  period  the  infant  de¬ 
velops  the  ability  to  disengage  from  visual  stimuli  and  to  control  the  areas 
of  the  visual  field  to  which  they  will  attend.  An  important  goal  of  early 
infant  development  is  the  control  of  pain  and  distress.  PET  studies  suggest 
that  the  highest  representation  of  pain  is  within  the  anterior  cingulate  (Tal¬ 
bot  et  al.  1991).  This  finding  suggests  that  attentional  manipulations  might 
be  important  in  the  control  of  distress.  We  have  shown  the  evidence  of  such 
control  at  about  three  months.  Orienting  to  visual  events  can  be  employed 
to  quiet  or  calm  negative  vocalizations.  However,  the  distress  appears  to 
be  maintained  and  reappears  when  the  infant's  attention  to  the  stimulus  is 
reduced.  Caregivers  also  report  the  use  of  visual  orienting  to  block  overt 
manifestations  of  distress  at  about  this  age.  It  appears  that  the  attention 
system  continues  to  develop  later  in  the  first  year  of  life  and  beyond.  One 
the  hallmarks  of  the  higher  level  attentional  system  involving  the  anterior 
cingulate  is  in  involvement  in  tasks  that  involve  conflict  between  signals 
such  as  the  stroop  effect  (Pardo  et  al.  1990).  At  9  to  12  months  one  begins  to 
see  how  control  of  reaching  behavior  allows  the  infant  to  reach  separately 
from  the  line  of  regard  (Diamond  1988).  Control  of  language  behavior  by 
the  attentional  system  occurs  even  later.  Studies  of  development  provide 
another  method  for  observing  attention  and  for  testing  models. 

PRINCIPLES  CONNECTING  COGNITIVE  AND  NEURAL  SYSTEMS 

It  is  often  difficult  to  grasp  the  principles  that  arise  out  of  various  exper¬ 
imental  demonstrations  and  modeling  efforts,  particularly  when  they  are 
presented  in  abbreviated  form  such  as  in  this  chapter.  Below  we  attempt 
to  summarize  some  of  the  general  ideas  related  to  our  framework  that 
appear  to  arise  from  our  review.  Each  of  these  principles  seeks  to  con¬ 
nect  mental  experience  to  neural  areas  via  the  methods  we  have  described 
above. 

1.  Elementary  mental  operations  are  localized  in  discrete  neural  areas.  Evidence 
supporting  this  point  rests  both  upon  work  in  attention  and  in  language. 
Operations  involved  in  selective  attention  discussed  in  this  chapter  are 
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carried  out  by  diverse  networks.  The  study  of  areas  active  during  auditory 
and  visual  words  processing  leads  us  to  a  similar  conclusion  (Petersen  et 
al.  1989).  In  addition,  we  view  current  discussions  of  motor  control,  object 
and  face  recognition,  and  memory  as  providing  additional  support  for  this 
general  idea. 

2.  Cognitive  tasks  are  performed  by  a  network  of  widely  distributed  neural  sys- 
tems.  We  have  illustrated  this  idea  in  this  chapter  by  showing  that  the  atten¬ 
tion  involves  several  networks  of  cortical  and  subcortical  areas.  Studies  of 
visual  and  auditory  word  processing  also  suggest  networks  of  anatomical 
areas  are  involved  even  in  very  simple  word-association  tasks  (Petersen  et 
al.  1989). 

3.  Co??iputations  in  a  network  interact  by  means  of  "reentrant"  processes.  Cog¬ 
nitive  experiments  give  good  evidence  that  the  successful  ordering  of  com¬ 
putations  is  necessary  for  performance.  Ordering  does  not  take  place  by 
a  strict  serial  organization.  Instead,  computations  appear  to  pass  infor¬ 
mation  back  and  forth  to  coordinate  their  results.  While  it  has  been  clear 
that  precise  connections  exist  between  anatomically  distant  areas,  it  ap¬ 
pears  that  a  particular  anatomical  area  is  active  whenever  its  computation 
is  required.  Since  computations  are  often  contingent  on  information  from 
another  area,  this  can  take  place  only  if  that  information  is  fed  back  to 
reenter  the  critical  areas. 

4.  Hierarchical  control  is  a  property  of  network  operation.  Discovery  of  a  sepa¬ 
rate  network  of  anatomical  areas  devoted  to  attention  has  been  described. 
This  view  provides  a  basis  for  establishing  executive  control  over  widely 
distributed  networks.  The  requirement  for  such  control  systems  is  clear 
from  cognitive  experiments  showing  interference  between  simultaneous 
performance  of  cognitive  tasks  irrespective  of  the  nature  of  their  constituent 
computations.  Moreover,  the  control  appeared  to  be  largely  inhibitory. 
That  is,  attention  to  one  concept  reduces  the  probability  that  other  con¬ 
cepts  receive  attention.  Selection  between  simultaneously  operating  rep¬ 
resentations  appears  to  require  inhibition  of  one  of  them.  These  findings 
supported  the  idea  of  executive  control  by  attention  systems. 

5.  Activation  of  a  computation  produces  a  temporary  reduction  in  the  threshold 
for  its  reactivation.  This  principle  underlies  the  cognitive  phenomenon  of 
priming.  Whenever  a  code  has  been  active  it  becomes  easier  for  a  stimulus 
to  reactivate  it.  For  the  processing  of  words,  priming  exists  at  the  level  of 
attributes,  word  forms,  phonology,  and  semantics. 

6.  When  a  computation  is  repeated  its  reduced  threshold  is  accompanied  by  reduced 
effort  and  less  attention.  This  principle  may  seem  the  inverse  of  the  last,  but  in 
fact  it  is  a  corollary.  The  repetition  of  a  computation  improves  its  efficiency; 
as  a  result,  the  overall  activity  accompanying  the  computation  is  reduced. 
Blood  flow  is  less,  electrical  activity  reduced,  and  the  interference  between 
the  repeated  computation  and  other  activity  is  reduced.  Thus  a  habituated 
activity  will  produce  less  orienting  of  attention,  the  memory  of  it  having 
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been  performed  will  be  reduced,  and  other  signs  that  the  activity  has  been 
automated  will  be  found. 

7.  Activating  a  computation  from  sensory  input  (bottom-up)  and  from  attention 
(top-down)  involves  many  of  the  same  neurons.  Attention  to  motion,  color,  or 
form  activates  many  of  the  same  prestriate  areas  that  were  active  when 
passively  receiving  information  of  the  same  type.  There  is  some  evidence 
that  the  size  and/or  number  of  prestriate  areas  active  in  attention  condi¬ 
tions  are  greater  than  during  the  comparable  passive  perception  condition. 
The  same  principle  was  discussed  above  from  recording  scalp  electrical  ac¬ 
tivity  during  word  processing.  We  believe  attention  can  be  used  to  mark 
the  activity  underlying  a  particular  computation. 

8.  Practice  in  the  performance  of  any  computation  will  decrease  the  neural  net¬ 
works  necessary  to  perform  it.  The  idea  that  repetition  of  events  leads  even¬ 
tually  to  their  automation,  that  is,  to  performance  without  attention,  is 
well  established  in  psychology  (Posner  1978).  Recent  PET  data  suggest 
that  repetition  of  the  same  performance  leads  to  reduced  blood  flow  in  the 
neural  areas  that  are  originally  required  to  generate  the  response  (Fiez  and 
Petersen,  1993).  We  believe  that  this  principle,  like  others  in  this  section, 
will  apply  to  cognitive  tasks  in  general. 
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INTRODUCTION 

The  ability  to  record  from  individual  neurons  in  the  central  nervous  system 
of  animals  while  these  engage  in  perceptual  tasks,  memorize,  or  perform 
a  motor  response  has  revealed  numerous  and  fascinating  correlations  be¬ 
tween  the  activity  of  individual  nerve  cells  and  complex  behavioral  pat¬ 
terns.  Cells  have  been  found  whose  responses  distinguish  between  familiar 
and  unfamiliar  objects  (Baylis  and  Rolls  1987;  Miyashita  and  Chang  1988; 
Fuster  1990;  Miller  et  al.  1991a,b),  are  selective  for  particular  aspects  of  faces 
(for  review  see  Rolls  1991;  Gross  1992),  reflect  precisely  the  location  of  a 
remembered  target  (Goldman-Rakic  et  al.  1990),  or  predict  with  accuracy 
the  direction  of  an  eye  movement  (Goldberg  et  al.  1990;  Wurtz  et  al.  1990). 
Cells  have  been  described  in  the  visual  cortex  whose  response  thresholds 
correspond  precisely  to  the  behavioral  thresholds  of  the  animal  (Parker 
and  Hawken  1985;  Newsome  et  al.  1990).  Finally,  activating  a  local  clus¬ 
ter  of  neurons  by  microstimulation  in  a  visual  area  specialized  for  motion 
processing  biases  the  perception  of  motion  as  if  additional  moving  targets 
were  added  to  the  visual  stimulus  (Salzman  et  al.  1992).  Results  of  this  kind 
are  strong  support  that  the  activation  of  individual  neurons  can  represent  a 
code  for  highly  complex  functions,  a  notion  that  is  commonly  addressed  as 
the  "single  neuron  doctrine"  (Barlow  1972).  However,  there  have  always 
also  been  speculations  that  additional  coding  principles  might  be  realized. 
Most  of  these  begin  with  Donald  Hebb's  proposal  that  representations  of 
sensory  or  motor  patterns  should  consist  of  assemblies  of  cooperatively 
interacting  neurons  rather  than  of  individual  cells.  This  coding  principle 
implies  that  information  is  contained  not  only  in  the  activation  level  of  in¬ 
dividual  neurons  but  also,  and  actually  to  a  crucial  extent,  in  the  relations 
between  the  activities  of  distributed  neurons.  If  true,  the  description  of  a 
particular  neuronal  state  would  have  to  take  into  account  not  only  the  rate 
and  the  specificity  of  individual  neuronal  responses  but  also  the  relations 
between  discharges  of  distributed  neurons.  Over  the  last  decade,  these 
speculations  have  received  some  support  both  from  experimental  results 
and  theoretical  considerations.  Search  for  individual  neurons  responding 
with  the  required  selectivity  to  individual  objects  was  only  partly  success- 


ful  and  has  so  far  revealed  specificity  only  for  faces  and  for  a  limited  set 
of  objects  with  which  the  animal  had  been  familiarized  extensively  before 
(see  below).  And  even  in  these  cases  it  is  likely  that  a  particular  face  or 
object  evokes  responses  in  a  very  large  number  of  neurons.  Recordings 
from  motor  centers  such  as  the  deep  layers  of  the  tectum  and  areas  of  the 
motor  cortex  provided  no  evidence  for  command  neurons  such  as  exist  in 
simple  nervous  systems  and  code  for  specific  motor  patterns.  Rather,  these 
studies  provided  strong  support  for  a  population  code  as  the  trajectory  of  a 
particular  movement  could  be  predicted  correctly  only  if  the  relative  con¬ 
tributions  of  a  large  number  of  neurons  were  considered  (Georgopoulos 
1990;  Mussa-Ivaldi  et  al.  1990;  Sparks  et  al.  1990).  Arguments  favoring  the 
possibility  of  relational  codes  have  also  been  derived  from  the  growing  evi¬ 
dence  that  cortical  processes  are  highly  distributed  (Zeki  1973;  Ungerleider 
and  Mishkin  1982;  Maunsell  and  Newsome  1987;  Desimone  and  Ungerlei¬ 
der  1989;  Newsome  et  al.  1990;  Felleman  and  Van  Essen  1991;  Zeki  et  al. 
1991;  Goodale  and  Milner  1992).  Further  indications  for  the  putative  signif¬ 
icance  of  relational  codes  are  provided  by  theoretical  studies  that  attempted 
to  simulate  certain  aspects  of  pattern  recognition  and  motor  control  in  ar¬ 
tificial  neuronal  networks.  Single  cell  codes  were  found  appropriate  for 
the  representation  of  a  limited  set  of  well-defined  patterns  but  the  number 
of  required  representational  elements  scaled  very  unfavorably  with  the 
number  of  representable  patterns.  Moreover,  severe  difficulties  were  en¬ 
countered  with  functions  such  as  figure-ground  distinction  because  single 
cell  codes  turned  out  to  be  too  rigid  and  inflexible,  again  leading  to  a  combi¬ 
natorial  explosion  of  the  required  representational  units.  By  implementing 
population  or  relational  codes  some  of  these  problems  could  be  solved  or 
at  least  alleviated.  Exploiting  relational  codes  also  opens  up  the  possibility 
to  use  time  as  an  additional  coding  space.  By  defining  a  narrow  temporal 
window  for  the  evaluation  of  coincident  firing  and  by  temporal  pattern¬ 
ing  of  individual  neuronal  responses,  relations  between  the  activities  of 
spatially  distributed  neurons  can  be  defined  very  selectively  (Milner  1974; 
von  der  Malsburg  1985).  If  such  temporal  coding  is  added  to  the  principle 
of  population  coding  the  number  of  different  patterns  or  representations 
that  can  be  generated  by  a  given  set  of  neurons  increases  substantially. 
Moreover,  it  has  been  demonstrated  that  perceptual  functions  like  scene 
segmentation  and  figure-ground  distinction  that  require  flexible  associa¬ 
tion  of  features  can  in  principle  be  solved  if  one  relies  on  relational  codes 
in  which  the  relatedness  of  distributed  neurons  is  expressed  by  the  tempo¬ 
rary  synchronization  of  their  respective  discharges  (Milner  1974;  von  der 
Malsburg  1985;  von  der  Malsburg  and  Schneider  1986;  Shimizu  et  al.  1986). 

Arguments  emphasizing  the  importance  of  temporal  relations  between 
the  discharges  of  cortical  neurons  have  also  been  derived  from  recent  data 
on  connectivity  and  synaptic  efficacy.  Cortical  cells  receive  many  thou¬ 
sand  synaptic  inputs  but  on  the  average  a  particular  cell  contacts  any  of  its 
target  cells  only  with  one  synapse  (Braitenberg  and  Schiitz  1991).  In  vitro 
studies  from  cortical  slices  indicate  that  the  efficacy  of  individual  synapses 
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is  low  and  that  not  every  presynaptic  action  potential  triggers  the  release  of 
transmitter  (Stevens  1987;  see  chapter  11  by  Stevens).  Thus,  many  presy¬ 
naptic  afferents  need  to  be  activated  simultaneously  to  drive  a  particular 
cell  above  threshold  and  to  ensure  reliable  transmission.  Even  more  co- 
operativity  is  required  for  the  induction  of  synaptic  modifications  such 
as  long-term  potentiation  and  long-term  depression.  These  modifications 
have  high  thresholds  and  require  substantial  and  prolonged  postsynap- 
tic  activation  (Artola  and  Singer  1987,  1990).  Temporal  coordination  of 
cortical  responses  appears  thus  necessary  both  for  successful  transmission 
across  successive  processing  stages  and  for  the  induction  of  use-dependent 
synaptic  modifications. 

Despite  these  numerous  arguments  supporting  the  putative  importance 
of  temporal  relations  among  distributed  neuronal  responses  in  the  neo¬ 
cortex,  systematic  search  for  temporal  relations  among  the  activities  of 
simultaneously  recorded  cortical  neurons  is  still  at  an  early  stage.  Initially, 
cross-correlation  analysis  of  multielectrode  recordings  has  been  used  pri¬ 
marily  as  a  tool  of  functional  anatomy  to  reveal  excitatory  and  inhibitory 
connections  among  neurons.  Hence,  analysis  was  often  confined  to  spon¬ 
taneous  activity.  If  cells  were  activated  with  sensory  stimuli  this  was  done 
to  increase  activity  and  to  reduce  the  duration  of  the  measurements,  but 
not  with  the  goal  to  disclose  dynamic,  stimulus-related  interactions.  It 
is  only  recently  that  cross-correlation  analysis  has  been  extended  to  re¬ 
sponses  evoked  by  selected  stimulus  configurations  to  test  whether  the 
responses  of  spatially  distributed  cortical  neurons  exhibit  temporal  rela¬ 
tions  that  are  sufficiently  consistent  to  serve  a  functional  role  in  cortical 
processing.  Many  of  these  latter  experiments  were  designed  to  test  specific 
predictions  derived  from  recent  theories  on  population  coding.  Therefore, 
the  review  of  these  cross-correlation  studies  will  be  preceded  by  a  brief  de¬ 
scription  of  the  conceptual  background.  As  most  of  the  theoretical  models 
have  dealt  with  problems  of  visual  pattern  processing  and  recognition  and 
as  most  of  the  experimental  studies  have  been  performed  in  the  visual  sys¬ 
tem,  the  conceptual  background  will  be  illustrated  mainly  on  the  basis  of 
visual  processes. 

REPRESENTATIONS  AND  THE  BINDING  PROBLEM 

Most  perceptual  objects  can  be  decomposed  into  components  and  in  gen¬ 
eral  the  features  of  these  components  are  not  unique  for  a  particular  object. 
The  individuality  of  objects  results  from  the  specific  composition  of  ele¬ 
mentary  features  and  their  relations  rather  than  from  the  specificity  of  the 
component  features.  Hence,  for  a  versatile  representation  of  sensory  pat¬ 
terns  in  the  nervous  system  three  basic  functions  have  to  be  accomplished: 
(1)  elementary  features  need  to  be  represented  by  neuronal  responses,  (2) 
responses  to  features  constituting  a  particular  object  have  to  be  distin¬ 
guished  and  bound  together  in  a  flexible  way,  and  (3)  the  specific  relations 
among  these  features  have  to  be  encoded  and  preserved. 
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One  way  to  achieve  the  grouping  of  features  and  to  establish  an  unam¬ 
biguous  code  for  their  specific  relations  is  to  connect  the  set  of  neurons  that 
responds  to  the  component  features  of  a  particular  object  to  a  higher  order 
neuron  that  will  represent  the  object.  If  the  thresholds  of  these  higher  or¬ 
der  neurons  are  adjusted  so  that  each  cell  responds  only  to  one  particular 
combination  of  feature  detectors,  the  responses  of  these  higher  order  neu¬ 
rons  would  provide  an  unambiguous  description  of  the  relations  between 
the  component  features  and  hence  would  be  equivalent  to  the  represen¬ 
tation  of  the  pattern.  In  this  scheme  the  features  of  the  object  are  bound 
together  by  convergence  of  fixed  connections  that  link  neurons  represent¬ 
ing  component  features  with  neurons  representing  the  whole  pattern.  The 
relations  between  features  are  encoded  by  the  specific  architecture  of  these 
convergent  connections. 

However,  not  all  of  the  predictions  following  from  this  latter  assump¬ 
tion  are  supported  by  experimental  evidence.  First,  while  cells  occupying 
higher  levels  in  the  processing  hierarchy  tend  to  be  selective  for  more  com¬ 
plex  constellations  of  features  than  cells  at  lower  levels,  many  continue 
to  respond  to  rather  simple  patterns  such  as  edges,  gratings,  and  simple 
geometric  shapes  (Tanaka  et  al.  1991;  Gallant  et  al.  1993).  Second,  apart 
from  cells  responding  preferentially  to  aspects  of  faces  and  hands  (Gross 
et  al.  1972;  Desimone  et  al.  1984, 1985;  Baylis  et  al.  1985;  Rolls  1991)  it  has 
been  notoriously  difficult  to  find  other  object-specific  cells  except  in  cases 
where  animals  had  been  familiarized  with  a  limited  set  of  objects  during 
extensive  training  (Miyashita  1988;  Sakai  and  Miyashita  1991).  Third,  no 
single  area  in  the  visual  processing  stream  has  yet  been  identified  that 
could  serve  as  the  ultimate  site  of  convergence  and  that  would  be  large 
enough  to  accommodate  the  vast  number  of  neurons  that  are  required  if 
all  distinguishable  objects  including  their  many  different  views  were  rep¬ 
resented  by  individual  neurons.  Finally,  the  point  has  been  made  that 
"binding  by  convergence"  may  not  be  flexible  enough  to  account  for  the 
rapid  formation  of  representations  of  new  patterns.  To  allow  for  the  rep¬ 
resentation  of  new,  hitherto  unknown  objects  one  would  have  to  postu¬ 
late  a  large  reservoir  of  uncommitted  cells.  These  neurons  would  have 
to  maintain  latent  input  connections  from  all  feature-selective  neurons  at 
lower  processing  stages  and  subsets  of  these  connections  would  have  to 
be  selected  and  consolidated  instantaneously  when  a  new  representation 
is  established. 

Similar  combinatorial  problems  arise  in  the  case  of  motor  control  but  they 
have  received  less  theoretical  consideration.  Here,  the  solution  equivalent 
to  "binding  by  convergence"  is  that  individual  command  neurons  at  the 
top  of  a  hierarchically  organized  motor  system  each  triggers  one  complex 
motor  act.  Their  activity  would  have  to  become  distributed  through  di¬ 
vergent  and  highly  selective  connections  to  subsets  of  effector  neurons 
that  eventually  activate  particular  muscle  groups.  This  coding  concept 
encounters  the  same  problem  as  its  homologous  concept  on  the  sensory 
side:  First,  no  such  command  neurons  were  found  in  areas  that  could  per- 
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haps  be  regarded  as  being  on  the  top  of  the  processing  hierarchy  such  as 
the  supplementary  motor  field  or  prefrontal  motor  areas.  Second,  given 
the  sparseness  of  connections  between  individual  cortical  cells  (see  above) 
it  is  hard  to  see  how  activation  of  only  a  few  neurons  could  give  rise  to 
the  mass  action  required  for  the  execution  of  a  movement.  Third,  one 
would  again  have  to  postulate  a  large  reservoir  of  uncommitted  cells  to 
allow  for  the  representation  of  newly  learned  motor  patterns.  These  un¬ 
committed  command  cells  would  have  to  maintain  latent  connections  to 
virtually  all  effector  muscles  and  the  appropriate  subsets  of  these  con¬ 
nections  would  have  to  become  functional  only,  but  then  would  have  to 
be  recruited  permanently,  when  the  particular  motor  skill  is  established 


for  which  these  connections  are  required.  Finally,  there  is  the  problem  of 
temporal  patterning.  This  problem  needs  to  be  solved  also  for  the  pro¬ 
cessing  of  sensory  patterns  if  these  are  spread  out  in  time,  but  it  is  par¬ 
ticularly  obvious  in  motor  programming.  For  the  execution  of  a  motor 
act  it  is  necessary  to  generate  complex  and  precisely  coordinated  tempo¬ 
ral  sequences  according  to  which  the  distributed  muscle  groups  are  ac¬ 
tivated.  One  solution  would  be  sets  of  delay  lines  to  distribute  the  ac¬ 


tivity  of  the  command  neuron  in  the  appropriate  temporal  order  to  the 
effector  neurons  at  more  peripheral  levels.  But  this  would  further  in¬ 
crease  the  number  of  command  neurons  because  each  motion  executed 
at  different  speeds  would  require  a  command  cell  connected  to  a  differ¬ 
ent  set  of  delay  lines.  The  inverse  problem  exists  for  the  representation 
of  sensory  patterns  that  have  not  only  a  spatial  but  also  a  temporal  struc¬ 
ture. 


Because  these  difficulties  cannot  be  overcome  easily  in  architectures  that 
solve  the  binding  problem  by  serial  recombination  of  converging  (in  the 
motor  path  "diverging")  feedforward  connections  alternative  proposals 
have  been  developed. 

Before  discussing  these  alternative  concepts  it  is  necessary  to  emphasize 
that  "binding  by  convergence"  may  be  a  viable  solution  for  specialized 
representational  systems.  However,  because  of  the  limitations  discussed 
above  this  coding  strategy  can  be  used  only  for  the  representation  of  a 
limited  set  of  stereotyped  patterns. 

Alternative  proposals  for  the  solution  to  the  binding  problem  are  based 
on  the  assumption  that  representations  consist  of  assemblies  of  a  large 
number  of  simultaneously  active  neurons  that  may  be  contained  in  a  sin¬ 
gle  cortical  area  but  that  may  also  be  distributed  over  many  cortical  areas 
(Hebb  1949;  Braitenberg  1978;  Edelman  and  Mountcastle  1978;  Crick  1984; 
Grossberg  1980;  Palm  1982,  1990;  Singer  1985,  1990;  von  der  Malsburg 
1985;  Edelman  1987, 1989;  Abeles  1991).  The  essential  feature  of  assembly 
coding  is  that  individual  cells  can  participate  at  different  times  in  the  rep¬ 
resentation  of  different  objects.  The  assumption  is  that  just  as  a  particular 
feature  can  be  present  in  many  different  patterns,  a  neuron  coding  for  this 
feature  can  be  shared  by  many  different  representations.  This  reduces  sub¬ 
stantially  the  number  of  cells  required  for  the  representation  of  different 
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objects  and  allows  for  considerably  more  flexibility  in  the  generation  of 
new  representations. 

Basic  requirements  for  representing  objects  by  such  assemblies  are  as  fol¬ 
lows:  First,  the  responses  of  the  cells  responding  to  a  visual  scene  need  to  be 
compared  with  one  another  and  examined  for  possible,  "meaningful"  rela¬ 
tions.  Second,  cells  coding  for  features  that  can  be  related  need  to  become 
organized  into  an  "assembly."  This  should  be  the  case  for  the  cells  that  are, 
for  example,  activated  by  the  constituent  features  of  a  particular  object. 
Third,  if  patterns  change,  neurons  must  be  able  to  rapidly  change  partners 
and  to  form  new  assemblies.  Fourth,  neurons  that  have  joined  a  particu¬ 
lar  assembly  must  become  identifiable  as  members  of  this  very  assembly. 
Their  responses  must  be  tagged  so  that  they  can  be  recognized  as  being  re¬ 
lated  (i.e.,  the  distributed  responses  of  the  assembly  must  be  recognizable 
as  representing  a  "whole").  It  is  commonly  assumed  that  these  organiz¬ 
ing  steps,  the  probing  of  possible  relations,  the  formation  of  an  assembly, 
and  the  labeling  of  responses  are  achieved  in  a  single  self-organizing  pro¬ 
cess  by  selective  reciprocal  connections  between  the  distributed  neuronal 
elements.  The  idea  is  that  the  probabilities  with  which  neurons  become  or¬ 
ganized  into  particular  assemblies  are  determined,  first,  by  the  respective 
constellation  of  features  in  the  pattern  and,  second,  by  the  functional  archi¬ 
tecture  of  the  assembly  forming  coupling  connections.  Several  proposals 
have  been  made  concerning  the  mechanisms  by  which  these  connections 
could  serve  to  "label"  the  responses  of  neurons  that  have  joined  into  the 
same  assembly.  Most  of  them  assume  that  the  assembly-generating  con¬ 
nections  are  excitatory  and  reciprocal  and  serve  to  enhance  and  to  prolong 
the  responses  of  neurons  that  were  organized  in  an  assembly  (Hebb  1949; 
Singer  1979, 1985;  Grossberg  1980;  Palm  1982). 

Another  proposal  is  that  assemblies  should  be  distinguished  in  addition 
by  a  temporal  code  (von  der  Malsburg  1985;  von  der  Malsburg  and  Schnei¬ 
der  1986).  A  similar  suggestion,  although  formulated  less  explicitly,  had 
been  made  previously  by  Milner  (1974).  This  hypothesis  assumes  that  the 
assembly-forming  connections  should  establish  temporal  coherence  on  a 
millisecond  time  scale  between  the  responses  of  the  coupled  cells.  Thus, 
neurons  having  joined  into  an  assembly  would  be  identifiable  as  mem¬ 
bers  of  the  assembly  because  of  the  synchronization  of  their  discharges. 
Expressing  relations  between  members  of  an  assembly  by  the  temporal 
coherence  rather  than  the  amplitude  of  their  responses  has  several  ad¬ 
vantages:  First,  it  reduces  the  ambiguities  that  result  from  the  fact  that 
discharge  rates  depend  strongly  on  variables  such  as  stimulus  intensity 
and  quality  of  fit  between  stimulus  features  and  receptive  field  properties. 
If  assemblies  were  solely  defined  by  a  rate  code  it  would  be  impossible  to 
decide  whether  a  strongly  active  cell  is  discharging  at  a  high  rate  because 
it  joined  an  assembly  or  because  it  was  activated  by  a  particularly  effective 
stimulus.  Relying  on  temporal  relations  preserves  the  important  option 
to  use  discharge  rates  as  a  code  for  stimulus  parameters.  This  is  essential 
in  systems  using  coarse  codes  because  the  information  about  the  presence 
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of  particular  features  and  about  their  precise  location  is  contained  in  the 
graded  responses  of  populations  of  cells.  Second,  exploiting  temporal  rela¬ 
tions  increases  the  number  of  assemblies  that  can  be  active  simultaneously 
without  becoming  confounded.  In  most  cases  simultaneously  active  as¬ 
semblies  will  be  distinguished  by  spatial  segregation  due  to  retinotopy 
and  compartmentalization  of  cortical  areas.  But  there  may  be  conditions 
in  which  additional  distinctions  are  required  to  avoid  fusion  of  unrelated 
assemblies.  Responses  of  neurons  could  overlap  on  a  coarse  time  scale  but 
still  remain  distinguishable  as  coming  from  a  particular  assembly  if  they 
are  correlated  at  a  fine  time  scale.  Third,  cells  that  succeeded  in  synchro¬ 
nizing  their  discharges  have  a  stronger  impact  on  target  cells.  This  follows 
from  the  plausible  assumption  that  afferents  to  cortical  neurons  will  be 
more  efficient  in  driving  a  postsynaptic  cell  if  they  discharge  in  synchrony. 
This  effect  will  be  particularly  strong  when  the  activation  levels  of  the  af¬ 
ferent  fibers  are  low  and  when  the  postsynaptic  potentials  evoked  by  the 
individual  fibers  are  small.  Both  conditions  seem  to  be  fulfilled  for  corti¬ 
cal  networks  (see  above).  Thus,  formation  of  coherently  active  assemblies 
can  serve  to  enhance  the  saliency  of  responses  to  features  that  can  be  as¬ 
sociated  in  a  "meaningful"  way.  This  may  contribute  to  the  segregation 
of  object-related  features  from  unrelated  features  of  the  background.  This 
concept  of  "binding  by  synchrony"  has  also  been  applied  to  intermodal 
integration  (Damasio  1990)  and  even  to  high  level  processes  underlying 
phenomena  such  as  attention  (Crick  1984)  and  consciousness  (Crick  and 
Koch  1990a). 

PREDICTIONS 

A  network  that  allows  for  the  self-organization  of  pattern  specific  assem¬ 
blies  must  meet  the  following  constraints: 

1.  Neurons  within  the  same  cortical  area  as  well  as  neurons  distributed 
across  different  areas  must  be  coupled  reciprocally  by  connections  ensuring 
the  selection  and  dynamic  stabilization  of  specific  assemblies. 

2.  These  connections  must  be  exceedingly  numerous  because  their  num¬ 
ber,  together  with  the  number  of  cells,  limits  the  number  of  possible  con¬ 
stellations. 

3.  The  assembly  forming  connections  must  be  highly  specific  as  the  group¬ 
ing  criteria  according  to  which  features  are  bound  together  into  object  rep¬ 
resentations  reside  in  the  functional  architecture  of  these  connections. 

4.  The  network  must  allow  for  highly  dynamic  interactions  to  enable  in¬ 
dividual  cells  to  link  at  different  times  with  different  partners. 

5.  The  coupling  connections  must  have  adaptive  synapses  allowing  for 
use-dependent  long-term  modifications  of  synaptic  gain  to  permit  the  for¬ 
mation  and  stabilization  of  new  grouping  criteria  when  new  object  repre¬ 
sentations  are  to  be  installed  during  learning. 
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6.  These  use-dependent  synaptic  modifications  should  follow  a  correlation 
rule  whereby  synaptic  connection  should  strengthen  if  pre-  and  postsynap- 
tic  activity  is  often  correlated,  and  they  should  weaken  in  case  there  is  no 
correlation.  This  is  required  to  enhance  grouping  of  features  that  often  oc¬ 
cur  in  consistent  relations  as  is  the  case  for  features  constituting  a  particular 
object. 

7.  These  grouping  operations  should  occur  over  multiple  processing  stages 
because  search  for  "meaningful"  groupings  has  to  be  performed  at  different 
spatial  scales  and  according  to  different  feature  domaines.  This  could  be 
achieved  by  distributing  the  grouping  operations  over  different  cortical 
areas  in  which  different  neighborhood  relations  are  realized  with  respect  to 
the  representation  of  retinal  location  and  of  feature  domains  by  remapping 
of  inputs. 

These  seven  predictions  need  to  be  fulfilled  irrespective  of  whether  as¬ 
semblies  are  defined  by  rate  or  temporal  codes.  If  cells  having  joined  an 
assembly  are  distinguished  by  a  rate  code  the  prediction  is  that  cells  ac¬ 
tivated  by  features  of  a  particular  object  engage  in  stronger  and  perhaps 
also  more  sustained  responses  than  cells  responding  to  features  resisting 
grouping.  However,  no  differences  should  be  found  between  the  enhanced 
responses  of  cells  participating  in  different  assemblies  representing  differ¬ 
ent  objects.  They  should  all  be  enhanced  to  a  similar  extent.  If  assemblies 
are  distinguished  in  addition  or  alternatively  by  the  temporal  coherence  of 
the  responses  of  the  constituting  neurons  a  further  set  of  predictions  can 
be  derived. 

1.  Spatially  segregated  neurons  should  synchronize  their  responses  if  ac¬ 
tivated  by  features  that  can  be  grouped  together.  This  should  be  the  case 
for  features  constituting  a  single  object. 

2.  Synchronization  should  be  frequent  among  neurons  within  a  particular 
cortical  area  but  it  should  also  occur  across  cortical  areas. 

3.  The  probability  that  neurons  synchronize  their  responses  both  within  a 
particular  area  and  across  areas  should  reflect  some  of  the  Gestalt  criteria 
used  for  perceptual  grouping  (Kofka  1935;  Kohler  1969). 

4.  Individual  cells  must  be  able  to  rapidly  change  the  partners  with  which 
they  synchronize  their  responses  if  stimulus  configurations  change  and 
require  new  associations. 

5.  If  more  than  one  object  is  present  in  a  scene  several  assemblies  should 
form.  Cells  belonging  to  the  same  assembly  should  synchronize  their  re¬ 
sponses  while  no  consistent  temporal  relations  should  exist  between  the 
discharges  of  neurons  belonging  to  different  assemblies. 

6.  Synchronization  probability  should  at  least  in  part  depend  on  the  func¬ 
tional  architecture  of  reciprocal  corticocortical  connections  and  should 
change  if  this  architecture  is  modified. 
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EXPERIMENTAL  EVIDENCE  FOR  TEMPORAL  RELATIONS 


To  test  these  predictions  it  is  necessary  to  record  simultaneously  from  spa¬ 
tially  distributed  neurons  in  the  brain  and  to  search  for  systematic  tem¬ 
poral  correlations  among  their  responses  (Gerstein  et  al.  1985).  It  is  not 
sufficient  to  analyze  synchronization  probability  of  spontaneous  activity 
as  this  would  reveal  only  the  architecture  and  coupling  strength  of  con¬ 
nections  and  not  the  dynamic  properties  of  the  network  that  emerge  only 
on  stimulation.  Thus,  the  correlation  studies  that  have  been  performed 
with  the  goal  of  revealing  anatomical  connection  patterns  are  relevant  in 
the  present  context  in  as  much  as  they  provide  data  on  the  organization  of 
coupling  connections  but  they  usually  do  not  address  the  more  dynamic, 
stimulus-dependent  interactions  that  are  predicted  from  the  assembly  hy¬ 
pothesis. 

In  the  visual  cortex  correlations  between  the  activities  of  simultaneously 
recorded  cortical  cells  were  found  to  be  frequent,  especially  when  they 
were  closely  spaced  and  located  within  single  functional  columns.  The 
observed  correlation  patterns  were  indicative  of  constellations  where  cells 
receive  either  common  excitatory  or  inhibitory  input  or  where  one  cell  ex¬ 
cites  or  inhibits  the  other.  (Toyama  et  al.  1981a,b;  Michalski  et  al.  1983; 
Abeles  and  Gerstein  1988;  Hata  et  al.  1988;  Gochin  et  al.  1991).  For  cells  lo¬ 
cated  in  different  functional  columns  and  hence  being  separated  by  several 
hundred  micrometers  along  trajectories  parallel  to  the  pial  surface  corre¬ 
lations  were  more  difficult  to  detect.  This  agrees  with  anatomical  data 
that  indicate  that  connections  are  densest  between  cells  staggered  within 
narrow  cylinders  orthogonal  to  the  lamination  and  rapidly  decrease  along 
trajectories  tangential  to  the  lamination  (for  review  see  Douglas  and  Mar¬ 
tin  1993).  When  interactions  were  found  over  larger  tangential  distances 
the  cross-correlograms  usually  had  a  peak  centered  around  zero  delay  that 
was  interpreted  as  indicative  of  common  excitatory  input  (Tso  et  al.  1986; 
Kruger  and  Aiple  1988;  Gochin  et  al.  1991)  or  of  common  modulation  of 
excitability  (Aiple  and  Kruger  1988;  Kruger  and  Aiple  1988).  However,  as 
detailed  below,  there  may  be  interpretations  other  than  common  input  for 
this  type  of  interaction. 

Data  from  the  visual  cortex  of  cats  and  monkeys  suggested  in  addition 
that  long-range  interactions  are  confined  to  neurons  that  share  similar  pref¬ 
erences  for  the  orientation  and/or  the  spectral  composition  of  stimuli  (Tso 
et  al.  1986;  Tso  and  Gilbert  1988;  Schwarz  and  Bolz  1991).  This  agrees  with 
anatomical  data  that  show  that  tangential  intracortical  connections  are  se¬ 
lective  and  link  preferentially  eyenly  spaced  patches  of  cortical  tissue  that 
are  closely  related  to  functional  columns  (Rockland  and  Lund  1982;  Gilbert 
and  Wiesel  1989;  but  see  Matsubara  et  al.  1985). 

Systematic  search  for  more  dynamic  stimulus-dependent  interactions 
between  spatially  distributed  cortical  neurons  had  been  initiated  by  the 
observation  that  adjacent  neurons  in  the  cat  visual  cortex  can  transiently 
engage  in  highly  synchronous  discharges  when  presented  with  their  pre- 
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ferred  stimulus  (Gray  and  Singer  1987).  Groups  of  neurons  recorded 
simultaneously  with  a  single  electrode  were  found  to  discharge  in  syn¬ 
chronous  "bursts"  that  followed  one  another  at  intervals  of  15-30  msec. 
Typically,  individual  neurons  would  contribute  only  one  or  two  spikes  to 
such  synchronous  events  but  in  the  multicell  recordings  these  episodes 
of  synchronous  discharge  appeared  as  bursts.  These  sequences  of  syn¬ 
chronous  rhythmic  firing  occur  preferentially  when  cells  are  activated  with 
slowly  moving  light  stimuli.  They  last  no  more  than  a  few  hundred  mil¬ 
liseconds  and  may  occur  several  times  during  a  single  passage  of  the  mov¬ 
ing  stimulus  (figure  10.1).  Accordingly,  autocorrelograms  computed  from 
such  response  epochs  often  exhibit  a  periodic  modulation  (Gray  and  Singer 
1987,  1989;  Eckhom  et  al.  1988;  Gray  et  al.  1990;  Schwarz  and  Bolz  1991; 
Livingstone  1991).  During  such  episodes  of  synchronous  firing  a  large 
oscillatory  field  potential  is  recorded  by  the  same  electrode,  the  negative 
deflections  being  coincident  with  the  cells'  discharges.  The  occurrence  of 
such  a  large  field  response  indicates  that  many  more  cells  in  the  vicinity 
of  the  electrode  than  those  actually  picked  up  by  the  electrode  must  have 
synchronized  their  discharges  (Gray  and  Singer  1989). 

However,  neither  the  time  of  occurrence  of  these  synchronized  response 
episodes  nor  the  phase  of  the  oscillations  is  related  to  the  position  of  the 
stimulus  within  the  neuron's  receptive  field.  When  cross-correlation  func¬ 
tions  are  computed  between  responses  to  subsequently  presented  identical 
stimuli,  these  "shift  predictors"  reveal  no  relation  between  the  temporal 
patterning  of  successive  responses  (Gray  and  Singer  1989;  Gray  et  al.  1990). 
The  rhythmic  firing  is  thus  not  related  to  some  fine  spatial  structure  in  the 
receptive  fields  of  cortical  neurons. 

This  phenomenon  of  local  response  synchronization  has  been  observed 
with  multiunit  and  field  potential  recordings  in  several  independent  stud¬ 
ies  in  different  areas  of  the  visual  cortex  of  anesthetized  cats  (areas  17, 18, 
19,  and  PMLS)  (Eckhom  et  al.  1988, 1992;  Gray  and  Singer  1989;  Gray  et  al. 
1990;  Engel  et  al.  1991a),  in  area  17  of  awake  cats  (Raether  et  al.  1989;  Gray 
and  Viana  di  Prisco  1993),  in  the  optic  tectum  of  awake  pigeons  (Neuen- 
schwander  and  Varela  1990),  and  in  the  visual  cortex  of  anesthetized  (Liv¬ 
ingstone  1991)  and  awake  behaving  monkeys  (Kreiter  and  Singer  1992). 

Subsequently,  it  has  been  shown  with  multielectrode  recordings  in  anes¬ 
thetized  and  awake  cats  (Gray  et  al.  1989;  Raether  et  al.  1989;  Engel  et  al. 
1990;  Kreiter  and  Singer,  1992;  Kreiter  et  al.  1992)  and  anesthetized  and 
awake  monkeys  (Kreiter  and  Singer  1992;  Kreiter  et  al.  1992)  that  similar 
response  synchronization  can  occur  also  between  spatially  segregated  cell 
groups  within  the  same  visual  area.  When  cells  engage  in  such  long  dis¬ 
tance  synchronization  the  firing  patterns  of  the  local  groups  often  exhibit 
the  synchronous  repetitive  firing  described  above.  Interestingly,  the  syn¬ 
chronization  of  responses  over  larger  distances  also  occurs  with  zero  phase 
lag.  Hence,  if  the  cross-correlograms  show  any  interaction  at  all,  they  typi¬ 
cally  have  a  peak  centered  around  zero  delay.  The  half-width  at  half-height 
of  this  peak  is  on  the  order  of  2-3  msec,  indicating  that  most  of  the  action  po- 
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Figure  10.1  MUA  and  LFP  responses  recorded  from  area  17  in  an  anesthetized  adult  cat  to  the 
presentation  of  an  optimally  oriented  light  bar  moving  across  the  receptive  field.  Oscilloscope 
records  of  a  single  trial  showing  the  response  to  the  preferred  direction  of  movement.  In  the 
upper  two  traces,  at  a  slow  time  scale,  the  onset  of  the  neuronal  response  is  associated  with  an 
increase  in  high-frequency  activity  in  the  LFP.  The  lower  two  traces  display  the  activity  at  the 
peak  of  the  response  at  an  expanded  time  scale.  Note  the  presence  of  rhythmic  oscillations 
in  the  LFP  and  MUA  (35-45  Hz)  that  are  correlated  in  phase  with  the  peak  negativity  of  the 
LFP.  Upper  and  lower  voltage  scales  are  for  the  LFP  and  MUA,  respectively.  (Adapted  from 
Gray  and  Singer  1989) 


tentials  that  showed  some  consistent  temporal  relation  had  occurred  nearly 
simultaneously.  This  peak  is  often  flanked  on  either  side  by  troughs  that 
result  from  pauses  between  the  synchronous  bursts.  When  the  duration 
of  these  pauses  is  sufficiently  constant  throughout  the  episode  of  synchro¬ 
nization,  the  cross-correlograms  show  in  addition  a  periodic  modulation 
with  further  side  peaks  and  troughs.  But  such  regularity  is  not  a  necessary 
requirement  for  synchronization  to  occur.  There  are  numerous  examples 
from  anesthetized  cats  (see,  e.g.,  Engel  et  al.  1991c;  Nelson  et  al.  1992c)  and 
especially  from  awake  monkeys  (Kreiter  and  Singer  1992)  that  responses  of 
spatially  distributed  neurons  can  become  synchronized  and  lead  to  cross- 
correlograms  with  significant  center  peaks  without  engaging  in  rhythmic 
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activity  that  is  sufficiently  regular  to  produce  a  periodical  modulation  of 
averaged  auto-  and  cross-correlograms.  However,  there  are  relations  be¬ 
tween  oscillatory  discharge  patterns  and  response  synchronization  that 
will  be  discussed  in  detail  in  a  later  section. 

The  Dependence  of  Response  Synchronization  on  Stimulus 
Configuration 

As  outlined  above  the  hypothesis  of  temporally  coded  assemblies  requires 
that  the  probabilities  with  which  distributed  cells  synchronize  their  re¬ 
sponses  should  reflect  some  of  the  Gestalt  criteria  applied  in  perceptual 
grouping.  Another  and  related  prediction  is  that  individual  cells  must 
be  able  to  change  the  partners  with  which  they  synchronize  whereby  the 
selection  of  partners  should  occur  as  a  function  of  the  patterns  used  to 
activate  the  cells.  In  this  section  experiments  are  reviewed  that  were  de¬ 
signed  to  address  these  predictions.  Detailed  studies  in  anesthetized  cats 
and  recently  also  anesthetized  and  awake  monkeys  have  revealed  that 
synchronization  probability  for  remote  groups  of  cells  is  determined  both 
by  factors  within  the  brain  as  well  as  by  the  configuration  of  the  stimuli 
(Gray  et  al.  1989;  Engel  et  al.  1990, 1991a,b,c;  Kreiter  et  al.  1992;  Konig  et  al. 
1993).  In  general,  synchronization  probability  within  a  particular  cortical 
area  decreases  with  increasing  distance  between  the  cells.  If  cells  are  so 
closely  spaced  that  their  receptive  fields  overlap,  the  probability  is  high 
that  their  responses  will  exhibit  synchronous  epochs  if  evoked  with  a  sin¬ 
gle  stimulus.  This  requires  that  the  orientation  and  direction  preferences 
of  the  cell  pairs  are  sufficiently  similar  or  that  their  tuning  is  sufficiently 
broad  to  allow  for  coactivation  by  a  single  stimulus.  As  recording  distance 
increases  synchronization  probability  becomes  more  and  more  dependent 
on  the  similarity  between  the  orientation  preferences  of  the  neurons  (Tso 
et  al.  1986;  Engel  et  al.  1990). 

Concerning  the  dependence  of  synchronization  probability  on  stimulus 
configuration  single  linearly  moving  contours  have  so  far  been  found  to 
be  most  efficient.  Gray  et  al.  (1989)  recorded  multiunit  activity  from  two 
locations  in  cat  area  17  separated  by  7  mm.  The  receptive  fields  of  the 
cells  were  nonoverlapping,  had  nearly  identical  orientation  preferences, 
and  were  spatially  displaced  along  the  axis  of  preferred  orientation.  This 
enabled  stimulation  of  the  cells  with  bars  of  the  same  orientation  under 
three  different  conditions:  two  bars  moving  in  opposite  directions,  two 
bars  moving  in  the  same  direction,  and  one  long  bar  moving  across  both 
fields  coherently.  No  significant  correlation  was  found  when  the  cells  were 
stimulated  by  oppositely  moving  bars.  A  weak  correlation  was  present  for 
the  coherently  moving  bars.  But  the  long  bar  stimulus  resulted  in  a  robust 
synchronization  of  the  activity  at  the  two  sites.  This  effect  occurred  in 
spite  of  the  fact  that  the  overall  number  of  spikes  produced  by  the  two 
cells  and  the  oscillatory  patterning  of  the  responses  were  similar  in  the 
three  conditions. 
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In  a  related  experiment  Engel  et  al.  (1991a)  demonstrated  that  the  syn¬ 
chronization  of  activity  between  cells  in  areas  17  and  PMLS  of  the  cat  also 
depends  on  the  properties  of  the  visual  stimulus.  They  recorded  from  cells 
in  the  two  areas  that  had  nonoverlapping  receptive  fields  with  similar  ori¬ 
entation  preference  that  were  aligned  colinearly.  This  made  it  possible  to 
examine  the  effects  of  coherent  motion  on  response  synchronization  be¬ 
tween  cells  located  in  different  areas.  Little  or  no  correlation  was  found 
when  the  cells  were  activated  by  oppositely  moving  contours  but  a  robust 
synchronization  occurred  when  the  cells  were  coactivated  by  a  single  long 
bar  moving  over  both  fields  (figure  10.2).  (Engel  et  al.  1991a).  These  find¬ 
ings,  combined  with  the  earlier  results,  indicate  that  the  global  properties 
of  visual  stimuli  can  influence  the  magnitude  of  synchronization  between 
widely  separated  cells  located  within  and  between  different  cortical  areas. 
Single  contours  but  also  spatially  separate  contours  that  move  coherently 
and  therefore  appear  as  parts  of  a  single  figure  are  more  efficient  in  in¬ 
ducing  synchrony  among  the  responding  cell  groups  than  incoherently 
moving  contours  that  appear  as  parts  of  independent  figures. 

These  results  indicate  clearly  that  synchronization  probability  depends 
not  only  on  the  spatial  segregation  of  cells  and  on  their  feature  preferences, 
the  latter  being  related  to  the  cells'  position  within  the  columnar  architec¬ 
ture  of  the  cortex,  but  also  and  to  a  crucial  extent  on  the  configuration  of  the 
stimuli.  So  far,  synchronization  probability  appears  to  reflect  rather  well 
some  of  the  Gestalt  criteria  for  perceptual  grouping.  The  high  synchro¬ 
nization  probability  of  nearby  cells  corresponds  to  the  binding  criterion  of 
"vicinity,"  the  dependence  on  receptive  field  similarities  agrees  with  the 
criterion  of  "similarity,"  the  strong  synchronization  observed  in  response 
to  continuous  stimuli  obeys  the  criterion  of  "continuity,"  and  the  lack  of 
synchrony  in  responses  to  stimuli  moving  in  opposite  directions  relates  to 
the  criterion  of  "common  fate." 

Experiments  have  also  been  performed  to  test  the  prediction  that  simul¬ 
taneously  presented  but  different  contours  should  lead  to  the  organization 
of  two  independently  synchronized  assemblies  of  cells  (Engel  et  al.  1991c; 
Kreiter  et  al.  1992).  If  groups  of  cells  with  overlapping  receptive  fields  but 
different  orientation  preferences  are  activated  with  a  single  moving  light 
bar  they  synchronize  their  responses  even  if  some  of  these  groups  are  sub- 
optimally  activated  (Engel  et  al.  1990,  1991c).  However,  if  such  a  set  of 
groups  is  stimulated  with  two  independent  spatially  overlapping  stimuli 
that  move  in  different  directions,  they  split  into  two  independently  syn¬ 
chronized  assemblies,  those  groups  joining  the  same  synchronously  active 
assembly  that  have  a  preference  for  the  same  stimulus  (figure  10.3).  Thus, 
the  two  stimuli  evoke  simultaneous  responses  in  a  large  array  of  spatially 
interleaved  neurons  but  these  neurons  become  organized  in  two  assemblies 
that  can  be  distinguished  because  of  the  temporal  coherence  of  responses 
within  and  the  lack  of  coherence  between  assemblies.  Cells  representing 
the  same  stimulus  exhibit  synchronized  response  epochs  while  no  con¬ 
sistent  correlations  occur  between  the  responses  of  cells  that  are  evoked 
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Figure  10.2  Multielectrode  recordings  from  an  anesthetized  cat  showing  that  interareal  syn¬ 
chronization  is  sensitive  to  global  stimulus  features.  (A)  Position  of  the  recording  electrodes. 
A17,  area  17;  LAT,  lateral  sulcus;  SUPS,  suprasylvian  sulcus;  PMLS,  posterior  mediolateral 
suprasylvian  sulcus;  P,  posterior;  L,  lateral.  (B1-B3)  Plots  of  the  receptive  fields  of  the  PMLS 
and  area  17  recording.  The  diagrams  depict  the  three  stimulus  conditions  tested.  The  circle 
indicates  the  visual  field  center.  (C1-C3)  Peristimulus  time  histograms  for  the  three  stimulus 
conditions.  The  vertical  lines  indicate  1-sec  windows  for  which  autocorrelograms  and  cross- 
correlograms  were  computed.  Comparison  of  the  autocorrelograms  computed  for  the  three 
stimulus  paradigms.  Note  that  the  modulation  amplitude  of  the  correlograms  is  similar  in  all 
three  cases  (indicated  by  the  number  in  the  upper  right  comer).  (E1-E3)  Cross-correlograms 
computed  for  the  three  stimulus  conditions.  The  number  in  the  upper  right  comer  represents 
the  relative  modulation  amplitude  of  each  correlogram.  Note  that  the  strongest  correlogram 
modulation  is  obtained  with  the  continuous  stimulus.  The  cross-correlogram  is  less  regular 
and  has  a  lower  modulation  amplitude  when  two  light  bars  are  used  as  stimuli,  and  there 
is  no  significant  modulation  (n.s.)  with  two  light  bars  moving  in  opposite  direction.  (From 
Engel  et  al.  1991a) 
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by  different  stimuli.  The  parameters  of  the  individual  responses  such  as 
their  amplitude  or  oscillatory  patterning  were  not  affected  by  changes  in 
the  global  configuration  of  the  stimuli.  Thus,  it  is  not  possible  to  tell  from 
the  responses  of  individual  cells  whether  they  were  activated  by  a  single 
contour  or  by  two  different  stimuli.  Even  if  one  evaluated  the  extent  of 
coactivation  of  the  simultaneously  recorded  cells  on  a  coarse  time  scale 
of  several  100  msec  as  would  be  sufficient  for  the  analysis  of  rate-coded 
populations,  one  would  not  be  able  to  decide  whether  the  cells  had  been 
activated  by  one  composite  figure  whose  features  satisfy  the  preferences 
of  the  active  cells  or  by  two  independent  figures  that  excite  the  same  set  of 
cells.  The  only  cue  for  this  distinction  was  provided  by  the  evaluation  of 
synchronicity  at  a  millisecond  time  scale. 

The  results  of  these  experiments  also  prove  that  individual  groups  can 
change  the  partners  with  which  they  synchronize  when  stimulus  con¬ 
figurations  change.  Cell  groups  that  engaged  in  synchronous  response 
episodes  when  activated  with  a  single  stimulus  no  longer  did  so  when 
activated  with  two  stimuli  but  then  synchronized  with  other  groups.  One 
methodological  caveat  following  from  this  is  that  cross-correlation  analysis 
does  not  always  reliably  reflect  anatomical  connectivity  (see  also  Aertsen 
and  Gerstein  1985).  In  agreement  with  the  predictions  from  the  assembly 
hypothesis  interactions  between  distributed  cell  groups  were  found  to  be 
highly  dynamic,  variable,  and  strongly  influenced  by  the  constellation  of 
features  in  the  visual  stimulus. 

Synchronization  between  Areas 

Experiments  have  also  been  designed  to  test  the  prediction  that  cells  dis¬ 
tributed  across  different  cortical  areas  should  be  able  to  synchronize  their 
responses  if  they  respond  to  the  same  contour.  This  prediction  applies 
not  only  for  interactions  between  cells  distributed  within  different  visual 
areas  in  the  same  hemisphere  but  also  for  cells  in  different  hemispheres. 
The  reason  is  that  because  of  the  partial  decussation  of  the  optic  nerves, 
neurons  responding  to  a  figure  extending  across  the  vertical  meridian  are 
distributed  across  both  hemispheres.  As  the  responses  of  these  cells  have 
to  be  related  to  one  another  in  the  same  way  as  those  of  cells  located  within 
the  same  hemisphere  response  synchronization  should  occur  also  across 
hemispheres  and  depend  on  stimulus  configurations  in  the  same  way  as 
intrahemispheric  synchronization. 

In  the  cat,  interareal  synchronization  of  unit  responses  has  been  observed 
between  cells  in  areas  17  and  18  (Eckhom  et  al.  1988,  1992;  Nelson  et  al. 
1992a),  between  cells  in  areas  17  and  19  and  areas  18  and  19  (Eckhom  et  al. 
1992),  between  cells  in  area  17  and  area  PLMS,  an  area  specialized  for  mo¬ 
tion  processing  (figure  10.2)  (Engel  et  al.  1991a;  Munk  et  al.  1992),  and  even 
between  neurons  in  area  17  of  the  two  hemispheres  (Engel  et  al.  1991b;  Eck¬ 
hom  et  al.  1992;  Nelson  et  al.  1992b).  In  the  macaque  monkey  synchronous 
firing  has  been  observed  between  neurons  in  areas  VI  and  V2  (Bullier  et 
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Figure  10.3  Stimulus  dependence  of  short-range  interactions.  Multiunit  activity  was 
recorded  from  four  different  orientation  columns  of  area  17  of  cat  visual  cortex  separated 
by  0.4  mm.  The  four  cell  groups  had  overlapping  receptive  fields  and  orientation  preferences 
of  22°  (group  1),  112°  (group  2),  157°  (group  3),  and  90°  (group  4),  as  indicated  by  the  thick 
line  drawn  across  each  receptive  field  in  ( A-D ).  The  figure  shows  a  comparison  of  responses 
to  stimulation  with  single  moving  light  bars  of  varying  orientation  (left)  with  responses  to 
the  combined  presentation  of  two  superimposed  light  bars  (right).  For  each  stimulus  con¬ 
dition,  the  shading  of  the  receptive  fields  indicates  the  responding  cell  groups.  Stimulation 
with  a  single  light  bar  yielded  a  synchronization  between  all  cells  activated  by  the  respective 
orientation.  Thus,  groups  1  and  3  responded  synchronously  to  a  vertically  oriented  (0°) 
light  bar  (A),  groups  2  and  4  to  a  light  bar  at  an  orientation  of  112°  (B),  and  cell  groups  2 
and  3  to  a  light  bar  of  intermediate  orientation.  (C)  Simultaneous  presentation  of  two  stimuli 
with  orientations  of  0°  and  112°  respectively,  activated  all  four  groups  (D).  However,  in 
this  case  the  groups  segregated  into  two  distinct  assemblies,  depending  on  which  stimulus 
was  closer  to  the  preferred  orientation  of  each  group.  Thus,  responses  were  synchronized 
between  groups  1  and  3,  which  preferred  the  vertical  stimulus,  and  between  2  and  4,  which 


216 


Singer 


Figure  10.3  (continued)  preferred  the  stimulus  oriented  at  112°.  The  two  assemblies  were 
desynchronized  with  respect  to  each  other,  and  so  there  was  no  significant  synchronization 
between  groups  2  and  3.  The  cross-correlograms  between  groups  1  and  2, 1  and  4,  and  3  and 
4  were  also  flat  (not  shown).  Note  that  the  segregation  cannot  be  explained  by  preferential 
anatomical  wiring  of  cells  with  similar  orientation  preference  (T'so  et  al.  1986)  because  cell 
groups  can  readily  be  synchronized  in  all  possible  pair  combinations  in  response  to  a  single 
light  bar.  The  correlograms  are  shown  superimposed  with  a  numerically  fitted  Gabor  func¬ 
tion.  The  number  to  the  upper  right  of  each  correlogram  indicates  the  relative  modulation  am¬ 
plitude.  n.s.,  not  significant.  Scale  bars  indicate  the  number  of  spikes  (From  Engel  et  al.  1991c) 


al.  1992;  Roe  and  Tso  1992).  In  all  of  these  cases,  whenever  tested,  synchro¬ 
nization  depended  on  receptive  field  constellations  and  stimulus  configu¬ 
rations,  similar  to  the  intraareal  synchronization  (Engel  et  al.  1991a,b).  In 
the  studies  of  Eckhom  et  al.  (1988, 1992)  and  Engel  et  al.  (1991a)  interareal 
and  interhemispheric  synchronous  firing  was  found  to  occur  primarily,  if 
not  exclusively,  during  coactivation  of  the  cells  by  visual  stimuli,  and  was 
particularly  pronounced  during  periods  of  oscillatory  firing  (Konig  1994). 
In  the  studies  of  Nelson  et  al.  (1992a)  interareal  synchronous  firing  was 
observed  both  during  spontaneous  activity  and  during  the  presentation 
of  visual  stimuli.  The  interactions  span  a  wide  "tripartite"  range  of  tem¬ 
poral  scales  giving  rise  to  correlograms  having  central  peaks  of  narrow, 
medium,  and  broad  width.  The  narrow  (tight)  coupling  is  most  often  seen 
between  cells  having  overlapping  receptive  fields  with  similar  properties. 
The  broader  coupling  encompasses  a  much  wider  range  of  receptive  field 
separations  and  differences  in  orientation  preference  (Nelson  et  al.  1992a). 
Synchronous  firing  between  cells  in  VI  and  V2  in  the  monkey  shows  simi¬ 
lar  features.  Synchrony  occurs  between  cells  having  both  overlapping  and 
nonoverlapping  receptive  fields  (Bullier  et  al.  1992)  and  is  most  frequent 
between  cells  of  similar  color  selectivity  in  the  two  areas  (Roe  and  Tso 
1992).  Interestingly,  whenever  cells  in  VI  and  V2  engage  in  synchronous 
discharges,  on  the  average  cells  in  V2  lead  over  cells  in  VI  by  a  few  millisec¬ 
onds  (Bullier,  personal  communication).  Synchronization  of  responses  can 
thus  occur  over  considerable  distances  and  between  cell  groups  located  in 
different  cortical  areas  and  even  hemispheres. 

THE  SYNCHRONIZING  CONNECTIONS 

It  is  commonly  assumed  in  interpretations  of  cross-correlation  data  that 
synchronization  of  neuronal  responses  with  zero-phase  lag  is  indicative  of 
common  input  (Gerstein  and  Perkel  1972).  Because  response  synchroniza¬ 
tion  occurred  often  in  association  with  oscillatory  activity  in  the  range  of 
30-60  Hz,  it  has  been  proposed,  that  the  observed  synchronization  phe¬ 
nomena  in  the  visual  cortex  are  due  to  common  oscillatory  input  from 
subcortical  centers  (see  chapter  6  by  Llinas  and  Ribary,  and  Llinas  and  Rib¬ 
ary  1993).  Oscillatory  activity  in  the  30-60  Hz  range  has  been  described 
both  for  retinal  ganglion  cells  and  thalamic  neurons  (Doty  and  Kimura, 
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1963;  Bishop  et  al.  1964;  Fuster  et  al.  1965;  Arnett,  1975;  Ariel  et  al.  1983; 
Munemori  et  al.  1984;  Ghose  and  Freeman  1990;  Steriade  et  al.  1991;  Ghose 
and  Freeman  1992;  Pinault  and  Deschenes,  1992a,b;  Steriade  et  al.  1993). 
In  both  structures  oscillatory  activities  have  been  observed  in  about  20%  of 
the  cells.  They  occurred  during  spontaneous  activity  and  were  often  un¬ 
influenced  by  visual  stimulation  or  even  suppressed  (Ghose  and  Freeman 
1992).  These  oscillatory  patterns  in  afferent  activity  are  likely  to  contribute 
to  oscillatory  responses  in  the  visual  cortex,  but  the  possibility  must  also  be 
considered  that  part  of  the  thalamic  oscillations  are  backpropagated  from 
cortex  by  the  corticothalamic  projections. 

Another  question  is  whether  these  thalamic  oscillations  also  play  a  role  in 
stimulus-dependent  synchronization  of  spatially  distributed  cortical  neu¬ 
rons.  Because  the  terminal  arbors  of  thalamic  axons  span  only  3-4  mm  in 
the  cortex  (Ferster  and  LeVay  1978),  the  long  distance  correlations  within 
areas  but  especially  between  areas  and  different  hemispheres  would  re¬ 
quire  that  thalamic  activity  becomes  synchronized  not  only  across  different 
nuclei  but  even  across  the  thalami  of  the  two  hemispheres  to  contribute 
effectively  to  long  distance  correlations  at  the  cortical  level.  Large-scale 
synchronization  of  distributed  thalamic  neurons  is  common  during  sleep 
spindles  (Steriade  et  al.  1990)  but  so  far  correlated  30-60  Hz  oscillatory  ac¬ 
tivity  has  been  observed  only  between  closely  spaced  cells  (Arnett  1975). 

If  the  synchronization  phenomena  observed  at  the  cortical  level  were 
solely  a  reflection  of  common  subcortical  input,  this  would  be  incompati¬ 
ble  with  the  postulated  role  of  synchronization  in  perceptual  grouping.  The 
hypothesis  requires  that  synchronization  probability  depends  to  a  substan¬ 
tial  extent  on  interactions  between  the  neurons  whose  responses  actually 
represent  the  features  that  need  to  be  bound  together.  As  thalamic  cells 
possess  only  very  limited  feature  selectivity  one  is  led  to  postulate  that 
corticocortical  connections  should  also  contribute  to  the  synchronization 
process.  This  postulate  is  supported  by  the  finding  that  synchronization 
between  cells  located  in  different  hemispheres  is  abolished  when  the  cor¬ 
pus  callosum  is  cut  (Engel  et  al.  1991b;  Munk  et  al.  1992).  This  is  direct 
proof  (1)  that  corticocortical  connections  contribute  to  response  synchro¬ 
nization  and  (2)  that  synchronization  with  zero-phase  lag  can  be  brought 
about  by  reciprocal  interactions  between  spatially  distributed  neurons  de¬ 
spite  considerable  conduction  delays  in  the  coupling  connections.  Thus, 
synchrony  is  not  necessarily  an  indication  of  common  input  but  may  also 
be  the  result  of  a  dynamic  organization  process  that  establishes  coherent 
firing  by  reciprocal  interactions. 

Simulation  studies  are  now  available  that  confirm  that  synchrony  can  be 
established  without  phase  lag  by  reciprocal  connections  even  if  they  have 
slow  and  variable  conduction  velocities  as  long  as  the  propagation  delays 
do  not  exceed  about  one  quarter  of  the  oscillatory  period  (Schillen  and 
Konig  1990;  Schuster  and  Wagner  1990a,b;  Konig  and  Schillen  1991).  Use- 
dependent  developmental  selection  of  corticocortical  connections  could 
further  contribute  to  the  generation  of  architectures  that  favor  synchrony. 
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During  early  postnatal  development  corticocortical  connections  are  sus¬ 
ceptible  to  use-dependent  modifications  and  are  selected  according  to  a 
correlation  rule  (Lowel  and  Singer  1992;  see  below).  This  favors  consolida¬ 
tion  of  connections  whose  activity  is  often  in  synchrony  with  the  activity  of 
their  respective  target  cells.  Hence,  it  is  to  be  expected  that  connections  are 
selected  not  only  according  to  their  feature-specific  responses  but  also  as  a 
function  of  conduction  velocities  that  allow  for  a  maximum  of  synchrony. 

However,  the  possibility  to  achieve  synchrony  through  reciprocal  corti¬ 
cal  connections  does  not  exclude  a  contribution  of  common  input  to  the 
establishment  of  cortical  synchronization.  Especially  if  temporal  patterns 
of  responses  need  to  be  coordinated  across  distant  cortical  areas,  bifurcat¬ 
ing  corticocortical  projections  or  divergent  corticopetal  projections  from 
subcortical  structures  such  as  the  "nonspecific"  thalamic  nuclei,  the  basal 
ganglia,  and  the  nuclei  of  the  basal  forebrain  could  play  an  important  role. 
By  modulating  in  synchrony  the  excitability  of  selected  cortical  areas  they 
could  influence  very  effectively  the  probability  with  which  neurons  dis¬ 
tributed  across  these  selected  areas  engage  in  synchronous  firing.  A  contri¬ 
bution  of  diverging  cortical  backprojections  to  long-range  synchronization 
is  suggested  by  the  observation  that  unilateral  focal  inactivation  of  a  pre- 
striate  cortical  area  reduces  intraareal  and  interhemispheric  synchrony  in 
area  17  (Nelson  et  al.  1992b).  A  contribution  of  thalamic  mechanisms  to 
the  establishment  of  cortical  synchrony  has  yet  to  be  demonstrated. 

EXPERIENCE-DEPENDENT  MODIFICATIONS  OF  SYNCHRONIZING 
CONNECTIONS  AND  SYNCHRONIZATION  PROBABILITIES 

The  theory  of  assembly  coding  implies  that  the  criteria  according  to  which 
particular  features  are  grouped  together  reside  in  the  functional  architec¬ 
ture  of  the  assembly  forming  coupling  connections.  It  is  of  particular  inter¬ 
est,  therefore,  to  study  the  development  of  the  synchronizing  connections, 
to  identify  the  rules  according  to  which  they  are  selected,  to  establish  corre¬ 
lations  between  their  architecture  and  synchronization  probabilities,  and, 
if  possible,  to  relate  these  neuronal  properties  to  perceptual  functions. 

In  mammals  corticocortical  connections  develop  mainly  postnatally  (In- 
nocenti  1981;  Price  and  Blakemore  1985a;  Luhmann  et  al.  1986;  Callaway 
and  Katz  1990)  and  attain  their  final  specificity  through  an  activity-depen¬ 
dent  selection  process  (Innocenti  and  Frost  1979;  Price  and  Blakemore 
1985b;  Luhmann  et  al.  1990;  Callaway  and  Katz  1991).  Recent  results  from 
strabismic  kittens  indicate  that  this  selection  is  based  on  a  correlation  rule 
and  leads  to  disruption  of  connections  between  cells  which  often  exhibit 
decorrelated  activity  (Lowel  and  Singer  1992).  Raising  kittens  with  artifi¬ 
cially  induced  strabismus  leads  to  changes  in  the  connections  between  the 
two  eyes  and  cortical  cells  so  that  individual  cortical  neurons  become  con¬ 
nected  to  only  one  eye  (Hubei  and  Wiesel  1965a).  Cortical  neurons  split  into 
two  subpopulations  of  about  equal  size,  each  responding  rather  selectively 
to  stimulation  of  one  eye  only.  Because  of  the  misalignment  of  the  two  eyes 
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it  is  also  to  be  expected  that  there  are  no  consistent  correlations  between 
the  activation  patterns  of  neurons  driven  by  the  two  eyes.  Recently,  it  has 
been  found  that  strabismus,  when  induced  in  3-week-old  kittens,  leads  to 
a  profound  rearrangement  of  corticocortical  connections.  Normally,  these 
connections  link  cortical  territories  irrespective  of  whether  these  are  dom¬ 
inated  by  the  same  or  by  different  eyes.  In  the  strabismics,  by  contrast,  the 
tangential  intracortical  connections  come  to  link  with  high  selectivity  only 
territories  served  by  the  same  eye.  These  anatomical  changes  in  the  archi¬ 
tecture  of  corticocortical  connections  are  reflected  by  altered  synchroniza¬ 
tion  probabilities.  In  strabismic  cats  response  synchronization  no  longer 
occurs  between  cell  groups  connected  to  different  eyes  while  it  is  normal 
between  cell  groups  connected  to  the  same  eye  (Konig  et  al.  1990, 1993). 

These  results  have  several  important  implications.  First,  they  are  com¬ 
patible  with  the  notion  that  tangential  intracortical  connections  contribute 
to  response  synchronization  (see  above).  However,  as  strabismus  also  abol¬ 
ishes  convergence  of  projections  from  the  two  eyes  onto  common  cortical 
target  cells,  this  result  is  also  compatible  with  the  view  that  synchrony  is 
caused  by  common  input.  Second  these  results  agree  with  the  postulates 
of  the  assembly  hypothesis  that  the  assembly  forming  connections  should 
be  susceptible  to  use-dependent  modifications  and  be  selected  according 
to  a  correlation  rule.  Third,  the  modifications  of  intracortical  connections 
and  synchronization  probabilites  add  to  the  list  of  substrate  changes  that 
may  be  related  to  the  specific  perceptual  deficits  associated  with  early  on¬ 
set  squint.  Strabismic  subjects  usually  develop  normal  monocular  vision 
in  both  eyes,  but  they  become  unable  to  fuse  signals  conveyed  by  different 
eyes  into  coherent  percepts  even  if  these  signals  are  made  retinotopically 
contiguous  by  optical  compensation  of  the  squint  angle  (von  Noorden 
1990).  Thus,  in  strabismics,  binding  mechanisms  appear  to  be  abnormal  or 
missing  between  cells  driven  from  different  eyes.  The  lack  of  corticocorti¬ 
cal  connections  and  the  lack  of  response  synchronization  could  be  among 
the  reasons  for  this  deficit  in  addition  to  the  loss  of  binocular  neurons. 

These  findings  are,  at  least,  compatible  with  the  view  that  the  architecture 
of  corticocortical  connections,  by  determining  the  probability  of  response 
synchronization,  could  set  the  criteria  for  perceptual  grouping.  Since  this 
architecture  is  shaped  by  experience,  this  opens  up  the  possibility  that 
some  of  the  binding  and  segmentation  criteria  are  acquired  or  modified  by 
experience. 

CORRELATION  BETWEEN  PERCEPTUAL  DEFICITS  AND  RESPONSE 
SYNCHRONIZATION  IN  STRABISMIC  AMBLYOPIA 

Further  indications  for  a  relation  between  experience-dependent  modifi¬ 
cations  of  synchronization  probabilities  and  functional  deficits  come  from 
a  recent  study  of  strabismic  cats  who  had  developed  amblyopia.  Strabis¬ 
mus,  when  induced  early  in  life,  does  not  only  lead  to  a  loss  of  binocular 
fusion  and  stereopsis  but  may  also  lead  to  amblyopia  of  one  eye  (von 
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Noorden  1990).  This  condition  develops  when  the  subjects  solve  the  prob¬ 
lem  of  double  vision  not  by  alternating  use  of  the  two  eyes  but  by  con¬ 
stantly  suppressing  the  signals  coming  from  the  deviated  eye.  The  ambly¬ 
opic  deficit  usually  consists  of  reduced  spatial  resolution  and  distorted  and 
blurred  perception  of  patterns.  A  particularly  characteristic  phenomenon 
in  amblyopia  is  crowding,  the  drastic  impairment  of  the  ability  to  discrim¬ 
inate  and  recognize  figures  if  these  are  surrounded  with  other  contours. 
The  identification  of  neuronal  correlates  of  these  deficits  in  animal  models 
of  amblyopia  has  remained  inconclusive  because  the  contrast  sensitivity 
and  the  spatial  resolution  capacity  of  neurons  in  the  retina  and  the  lateral 
geniculate  nucleus  were  found  normal.  In  the  visual  cortex  identification  of 
neurons  with  reduced  spatial  resolution  or  otherwise  abnormal  receptive 
field  properties  remained  controversial  (for  a  discussion  see  Crewther  and 
Crewther  1990;  Blakemore  and  Vital  Durand  1992).  However,  multielec¬ 
trode  recordings  from  striate  cortex  of  cats  exhibiting  behaviorally  verified 
amblyopia  have  revealed  highly  significant  differences  in  the  synchroniza¬ 
tion  behavior  of  cells  driven  by  the  normal  and  the  amblyopic  eye,  respec¬ 
tively.  The  responses  to  single  moving  bars  that  were  recorded  simultane¬ 
ously  from  spatially  segregated  neurons  connected  to  the  amblyopic  eye 
were  much  less  well  synchronized  with  one  another  than  the  responses 
recorded  from  neuron  pairs  driven  through  the  normal  eye  (Roelfsema  et 
al.  1993).  This  difference  was  even  more  pronounced  for  responses  elicited 
by  gratings  of  different  spatial  frequency.  For  responses  of  cell  pairs  ac¬ 
tivated  through  the  normal  eye  the  strength  of  synchronization  tended 
to  increase  with  increasing  spatial  frequency  while  it  tended  to  decrease 
further  for  cell  pairs  activated  through  the  amblyopic  eye.  Apart  from 
these  highly  significant  differences  between  the  synchronization  behavior 
of  cells  driven  through  the  normal  and  the  amblyopic  eye  no  other  dif¬ 
ferences  were  found  in  the  commonly  determined  response  properties  of 
these  cells.  Thus,  cells  connected  to  the  amblyopic  eye  continued  to  re¬ 
spond  vigorously  to  gratings  whose  spatial  frequency  had  been  too  high 
to  be  discriminated  with  the  amblyopic  eye  in  the  preceding  behavioral 
tests  (see  figure  10.4).  These  results  suggest  that  disturbed  temporal  co¬ 
ordination  of  responses  such  as  reduced  synchrony  may  be  one  of  the 
neuronal  correlates  of  the  amblyopic  deficit.  Indeed,  if  synchronization 
of  responses  at  a  millisecond  time  scale  is  used  by  the  system  for  feature 
binding  and  perceptual  grouping,  disturbance  of  this  temporal  patterning 
could  be  the  cause  for  the  crowding  phenomenon,  as  this  can  be  regarded 
as  a  consequence  of  impaired  perceptual  grouping. 

THE  RELATIONSHIP  BETWEEN  SYNCHRONY  AND  OSCILLATIONS 

Before  reviewing  the  evidence  on  context-dependent  synchronization 
across  spatially  segregated  groups  of  neurons  in  structures  other  than  the 
visual  cortex  it  is  necessary  to  examine  the  relationship  between  response 
synchronization  on  the  one  hand  and  oscillatory  responses  on  the  other. 
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Figure  10.4  Amplitudes  and  synchronization  of  responses  to  gratings  of  different  spatial  fre¬ 
quencies  recorded  from  cats  with  strabismic  amblyopia.  ( A-D )  Responses  to  low  (A  and  B ) 
and  high  (C  and  D)  spatial  frequency  gratings,  recorded  simultaneously  from  two  cell  groups 
driven  by  the  normal  eye  (N-sites)  ( A  and  C)  and  two  cell  groups  driven  by  the  amblyopic 
eye  (A-sites)  (B  and  D ),  respectively.  The  left  and  right  panels  show  the  response  histograms 
and  the  corresponding  cross-correlograms,  respectively.  Note  that  response  amplitudes  de¬ 
crease  at  the  higher  spatial  frequency  in  both  cases,  while  the  relative  modulation  amplitude 
increases  for  the  N-N  pair  but  decreases  for  the  A-A  pair.  (£)  Cumulative  distribution  func¬ 
tions  of  the  differences  between  the  amplitude  of  responses  to  low  and  high  spatial  frequency 
gratings  of  optimal  orientation.  N-sites,  squares  (ti= 53);  A-sites,  triangles  (n= 35).  Abscissa, 
responses  to  high  spatial  frequency  minus  responses  to  low  spatial  frequency  gratings.  Note 
the  similarity  of  the  two  distributions  (P  >  0.1).  (F)  Cumulative  distribution  functions  of  the 
differences  between  relative  modulation  amplitudes  (DRMA  of  cross-correlograms  obtained 
for  responses  to  high  and  low  spatial  frequency  gratings  of  N-N  pairs  (squares,  n=24)  and  A-A 
pairs  (triangles,  n=ll).  DRMA  values  (abscissa)  were  calculated  by  subtracting  the  relative 
modulation  amplitude  obtained  with  the  low  spatial  frequency  from  that  obtained  with  the 
high  spatial  frequency.  The  difference  between  the  DRMA  distributions  of  N-N  pairs  and 
A-A  pairs  is  highly  significant.  (From  Roelfsema  et  al.  1993) 
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as  the  two  phenomena  are  often  but  not  necessarily  always  related.  The 
occurrence  of  oscillatory  responses  does  not  logically  imply  that  cells  dis¬ 
charge  in  synchrony.  Likewise,  the  nonoccurrence  of  oscillations  does  not 
exclude  synchrony.  Furthermore,  it  is  useful  to  analyze  to  what  extent  dif¬ 
ferent  recording  methods  are  appropriate  for  the  assessment  of  synchrony 
or  oscillatory  behavior  because  there  are  a  number  of  difficulties  with  the 
detectability  of  oscillatory  firing  patterns  in  single  cell  recordings  and  with 
the  definition  of  oscillations. 

No  inferences  can  of  course  be  drawn  from  single  cell  recordings  as  to 
whether  the  responses  of  the  recorded  cell  are  synchronized  with  others 
irrespective  of  whether  the  recorded  cell  is  found  to  discharge  in  an  oscilla¬ 
tory  manner.  The  situation  is  different  when  multiunit  recordings  are  ob¬ 
tained  with  a  single  electrode.  In  this  case  periodically  modulated  autocor- 
relograms  are  always  indicative  not  only  of  oscillatory  firing  patterns  but 
also  of  response  synchronization,  at  least  among  the  local  group  of  simul¬ 
taneously  recorded  neurons.  The  reason  is  that  such  periodic  modulations 
can  build  up  only  if  a  sufficient  number  of  the  simultaneously  recorded 
cells  are  oscillating  synchronously  and  at  a  sufficiently  regular  rhythm. 
However,  not  observing  periodically  modulated  autocorrelograms  of  mul¬ 
tiunit  recordings  neither  excludes  that  the  recorded  units  oscillate,  because 
nonsynchronized  oscillations  would  not  be  observable,  nor  excludes  that 
the  recorded  cells  actually  fire  in  synchrony,  because  they  could  do  so  in  a 
nonperiodic  way.  The  same  arguments  are  applicable  to  field  potential  and 
even  more  so  to  EEG  recordings.  If  they  exhibit  an  oscillatory  pattern  this 
always  implies  that  a  large  number  of  neurons  must  have  engaged  in  syn¬ 
chronized  rhythmic  activity  because  otherwise  the  weak  fields  generated 
by  activation  of  individual  synapses  and  neurons  would  not  sum  to  poten¬ 
tials  recordable  with  macroelectrodes.  But  again,  the  reverse  is  not  true: 
neither  oscillatory  discharge  patterns  nor  response  synchronization  can  be 
excluded  if  macroelectrode  recordings  fail  to  reveal  oscillatory  fluctuations. 

Furthermore,  it  needs  to  be  considered  that  single  cell  recordings  may 
not  be  particularly  well  suited  for  the  diagnosis  of  oscillatory  activity.  This 
is  suggested  by  results  from  the  visual  cortex  (Gray  et  al.  1990)  and  in 
particular  from  the  olfactory  bulb  (Freeman  and  Skarda  1985).  Individual 
discharges  of  single  units  may  be  precisely  time-locked  with  the  oscillating 
field  potential,  which  proves  that  these  discharges  participated  in  an  oscil¬ 
latory  process  and  occurred  in  synchrony  with  those  of  many  other  cells, 
without,  however,  showing  any  sign  of  oscillatory  activity  in  their  auto¬ 
correlation  function.  The  reasons  for  this  apparent  paradox  are  sampling 
problems  and  nonstationarity  of  the  time  series.  If  the  single  cell  does  not 
discharge  at  every  cycle  and  if  the  oscillation  frequency  is  not  perfectly  con¬ 
stant  over  a  period  of  time  sufficiently  long  to  sample  enough  discharges  for 
an  interpretable  autocorrelation  function,  the  oscillatory  rhythm  to  which 
the  cell  is  actually  locked  will  not  be  disclosable.  Thus,  the  less  active  a  cell 
and  the  higher  and  more  variable  the  oscillation  frequency,  the  less  legiti¬ 
mate  it  is  to  infer  from  nonperiodically  modulated  autocorrelograms  that  a 
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cell  is  not  oscillating.  This  sampling  problem  becomes  more  and  more  ac¬ 
centuated  as  the  frequency  of  the  oscillations  increases.  This  explains  why 
7-band  oscillations  have  been  observed  first  with  macroelectrodes  and  re¬ 
main  difficult  to  observe  with  microelectrodes  unless  one  can  record  from 
several,  synchronously  active  cells  simultaneously  Finally,  some  ambigu¬ 
ities  are  associated  with  the  term  "oscillations."  Most  commonly,  oscilla¬ 
tions  are  associated  with  periodic  time  series  such  as  are  produced  by  a  pen¬ 
dulum  or  an  harmonic  oscillator.  But  there  are  also  more  irregular  or  even 
aperiodic  time  series  which  are  still  called  oscillatory.  Such  irregular  oscil¬ 
lations  typically  occur  in  noisy  linear  or  in  nonlinear  systems  and  cover  a 
large  spectrum  of  phenotypes  from  slightly  distorted,  periodic  oscillations 
to  chaotic  oscillations  to  nearly  stochastic  time  series.  Oscillatory  phenom¬ 
ena  in  the  brain  are  rarely  of  the  harmonic  type  and  if  so  only  over  very  short 
time  intervals.  Most  often,  oscillatory  activity  in  the  brain  is  so  irregular 
that  autocorrelation  functions  computed  over  prolonged  periods  of  time 
frequently  fail  to  reveal  the  oscillatory  nature  of  the  underlying  time  series. 

These  considerations  need  to  be  taken  into  account  for  the  interpretation 
of  the  data  reviewed  below  as  these  have  been  obtained  with  very  different 
methods  and  often  also  different  goals. 

The  evidence  that  in  most  structures  investigated  phases  of  response 
synchronization  tend  to  be  associated  with  episodes  of  oscillatory  activity 
raises  the  question  as  to  whether  oscillations  and  synchrony  are  causally 
related. 

One  possibility  is  that  oscillatory  activity  favors  the  establishment  of 
synchrony  and  hence  is  instrumental  for  response  synchronization.  In 
oscillatory  responses,  the  occurrence  of  a  discharge  predicts  with  some 
probability  the  occurrence  of  the  next.  It  has  been  argued  that  this  pre¬ 
dictability  is  a  necessary  prerequisite  to  synchronize  remote  cell  groups 
with  zero  phase  lag,  despite  considerable  conduction  delays  in  the  cou¬ 
pling  connections  (for  review  see  Engel  et  al.  1992).  This  view  is  supported 
by  simulation  studies  that  have  shown  that  zero-phase  lag  synchroniza¬ 
tion  can  be  achieved  despite  considerable  conduction  delays  and  variation 
of  conduction  times  in  the  synchronizing  connections  if  the  coupled  cell 
groups  have  a  tendency  to  oscillate  (Schillen  and  Konig  1990, 1992;  Schus¬ 
ter  and  Wagner  1990a,b;  Konig  and  Schillen  1991).  Another  feature  of 
networks  with  oscillatory  properties  is  that  network  elements  that  are  not 
linked  directly  can  be  synchronized  via  intermediate  oscillators  (Konig 
and  Schillen  1991).  This  may  be  important,  for  instance,  to  establish  rela¬ 
tionships  between  remote  cell  groups  within  the  same  cortical  area,  or  for 
cells  distributed  across  cortical  areas  that  process  different  sensory  modal¬ 
ities.  In  both  cases,  linkages  either  via  intermediate  cortical  relays  or  even 
via  subcortical  centers  must  be  considered.  The  latter  possibility  is  sup¬ 
ported  by  the  occurrence  of  7-oscillations  in  a  variety  of  thalamic  nuclei 
(see  above).  These  considerations  suggest  that  oscillations,  while  not  con¬ 
veying  any  stimulus-specific  information  per  se,  may  be  instrumental  for 
the  establishment  of  synchrony  over  large  distances.  This  conjecture  is  sup- 
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ported  by  the  evidence  that  synchronization  over  larger  distances  within 
an  area  (>  2  mm)  or  between  areas  is  in  most  cases  associated  with  an  os¬ 
cillatory  patterning  of  the  local  discharges  (Konig  et  al.  1992a).  However, 
it  is  also  conceivable  that  oscillations  occur  as  a  consequence  of  synchrony. 
Simulation  studies  indicate  that  networks  with  excitatory  and  inhibitory 
feedback  have  the  tendency  to  converge  toward  states  where  discharges  of 
local  cell  clusters  become  synchronous  (Spoms  et  al.  1991;  Koch  and  Schus¬ 
ter  1992;  Deppisch  et  al.  1992).  Once  such  a  synchronous  voley  has  been 
generated,  the  network  is  likely  to  engage  in  oscillatory  activity.  Because  of 
recurrent  inhibition  and  because  of  Ca2+-activated  K+  conductances  (Llinas 
1988a,  1990b),  the  cells  that  had  emitted  a  synchronous  discharge  will  also 
become  simultaneously  silent.  On  fading  of  these  inhibitory  events,  fir¬ 
ing  probability  will  increase  simultaneously  for  all  cells  and  this,  together 
with  maintained  excitatory  input  and  nonlinear  voltage-gated  membrane 
conductances  such  as  the  low  threshold  Ca2+  channels  (Llin&s  1990b)  will 
favor  the  occurrence  of  the  next  synchronous  burst,  and  so  on.  Thus,  os¬ 
cillations  are  a  likely  consequence  of  synchrony  and  it  actually  becomes 
an  important  issue  to  understand  how  cortical  networks  can  be  prevented 
from  entering  states  of  global  oscillations  and,  if  they  do,  how  these  can 
be  terminated  (see,  e.g.,  Freeman  and  Skarda  1985).  These  issues  have 
recently  been  addressed  in  a  number  of  simulation  studies  (Hansel  and 
Sompolinsky  1992;  Konig  and  Schillen  1991;  Konig  et  al.  1992;  Schillen 
and  Konig  1991, 1993;  Sporns  et  al.  1991). 

EVIDENCE  FOR  RESPONSE  SYNCHRONIZATION  IN  NONVISUAL 
STRUCTURES 

Most  experiments  have  concentrated  on  the  analysis  of  oscillatory  activity, 
but  those  that  used  multiunit  or  field  potential  methods  provide  informa¬ 
tion  not  only  on  the  occurrence  of  oscillations  but  also  allow  one  to  make 
inferences  on  response  synchronization.  Only  few  studies  are  available 
that  explicitly  address  the  question  of  response  synchronization  with  mul¬ 
tielectrode  recordings  in  structures  other  than  the  visual  cortex.  Such  data 
are  now  available  for  the  somatosensory  and  motor  cortex  (Murthy  and 
Fetz  1992),  the  acoustic  and  the  frontal  cortex  (Aertsen  et  al.  1991;  Vaadia  et 
al.  1991;  Ahissar  et  al.  1992),  and  the  pigeon  optic  tectum  (Neuenschwander 
and  Varela  1993).  In  every  case  evidence  has  been  obtained  for  transient 
interactions  between  simultaneously  recorded  neurons.  As  in  the  visual 
cortex  these  episodes  of  manifest  interactions  were  usually  of  short  dura¬ 
tion.  In  the  acoustic  and  the  somatomotor  cortex  as  well  as  in  the  optic 
tectum  the  interactions  resembled  those  in  the  visual  cortex  (i.e.,  the  cells 
synchronized  their  responses  with  zero  phase  lag).  In  the  sensorimotor 
cortex  synchronization  has  been  tested  between  different  cortical  areas 
and  found  between  the  arm  representations  in  the  somatosensory  and  the 
motor  cortex  and  even  between  the  motor  cortices  of  the  two  hemispheres. 
In  the  other  areas,  only  within-area  correlations  were  sought.  Some  indi- 
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cations  are  available  that  the  observed  episodes  of  coupled  discharges  and 
synchrony  are  correlated  with  behavior.  Synchronization  between  units 
in  somatosensory  and  motor  cortex  is  particularly  pronounced  while  the 
monkey  tries  to  solve  a  difficult  reaching  task  but  vanishes  once  the  task 
is  learned  and  reaching  is  executed  without  difficulty  (Murthy  and  Fetz 

1992) .  Synchronization  between  units  in  the  frontal  cortex  has  been  re¬ 
ported  to  occur  in  contiguity  with  certain  behavioral  sequences  in  a  com¬ 
plex  delayed  matching  to  sample  task  (Aertsen  et  al.  1991).  And,  most 
importantly,  a  recent  study  showed  in  the  acoustic  cortex  of  awake  mon¬ 
keys  that  learning  a  stimulus-stimulus  association  transiently  increases  the 
coupling  between  nearby  cells  responsive  to  the  two  stimuli  with  the  effect 
that  their  responses  become  more  synchronous  (Ahissar  et  al.  1992). 

EEG  and  field  potential  recordings  from  humans  and  higher  mammals 
have  provided  abundant  evidence  for  7-band  oscillations  in  a  variety  of 
nonvisual  cortical  and  subcortical  structures  (for  further  review  of  the  ex¬ 
tensive  literature  see  Basar  1980;  Basar  and  Bullock  1992;  Gray  1993;  Singer 

1993) .  In  the  mammalian  olfactory  system,  for  example,  40-80  Hz  oscilla¬ 
tory  activity  is  evoked  during  the  inspiratory  phase  in  both  the  olfactory 
bulb  and  piriform  cortex  (Adrian  1942;  Freeman  1975).  This  activity  is 
synchronous  over  a  scale  of  several  millimeters  both  within  and  betwen 
the  two  structures  (Freeman  1975;  Bressler  1984, 1987;  Freeman  and  Skarda 
1985).  The  patterns  of  activity  that  emerge  during  these  coherent  states  cor¬ 
respond  to  specific  odors,  the  animal's  past  experience  with  the  odors  and 
their  behavioral  significance  (Freeman  and  Skarda  1985).  The  oscillatory 
activity  in  itself  is  not  thought  to  convey  any  specific  information,  rather 
it  is  viewed  as  a  basic  neuronal  mechanism  for  establishing  synchrony 
among  large  populations  of  coactive  cells.  As  discussed  above,  these  data 
can  be  taken  as  evidence  for  the  occurrence  of  synchrony,  at  least  among 
the  local  cluster  of  neurons  contributing  to  the  electrical  field  that  is  picked 
up  by  the  macroelectrode. 

Similar  field  oscillations  have  been  recorded  from  a  variety  of  neocortical 
areas  in  cats  and  monkeys.  This  activity  was  particularly  prominent  when 
the  subjects  were  aroused  and  in  a  state  of  focused  attention  (Dumenko 
1961;  Rougeul  et  al.  1979;  Spydell  et  al.  1979;  Bouyer  et  al.  1981;  Montaron 
et  al.  1982;  Sheer  1984, 1989;  Ribary  et  al.  1991;  Gaal  et  al.  1992;  Murthy  and 
Fetz  1992;  Tiitinen  et  al.  1993).  These  rhythmic  activities  are  synchronous 
over  relatively  large  areas  of  cortex  (Bouyer  et  al.  1981),  in  cases  of  the 
sensorimotor  cortex  they  occur  in  phase  with  similar  activity  in  the  ven- 
trobasal  thalamus  (Bouyer  et  al.  1981)  and  are  regulated  by  dopaminergic 
input  from  the  ventral  tegmentum  (Montaron  et  al.  1982).  Oscillatory  field 
potential  and  unit  activities  in  the  range  of  20-40  Hz  have  been  observed  in 
the  motor  cortex  of  alert  monkeys  (Gaal  et  al.  1992;  Murthy  and  Fetz  1992). 
These  signals  are  synchronous  over  widespread  areas  of  the  motor  corti¬ 
cal  map  within  and  between  the  two  cerebral  hemispheres,  between  the 
motor  and  somatosensory  cortices,  and  are  enhanced  in  amplitude  when 
the  animals  are  preparing  new  and  complicated  motor  acts  (Murthy  and 
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Fetz  1992).  The  rhythms  are  suppressed,  however,  during  the  execution  of 
trained  movements  (Donoghue  and  Sanes  1991). 

Oscillatory  components  in  the  /?-  and  7-frequency  ranges  have  also  been 
extensively  documented  in  humans.  In  several  early  studies  depth  record¬ 
ings  of  the  local  EEG  from  a  number  of  cortical  and  subcortical  sites  re¬ 
vealed  pronounced  episodes  of  synchronous  rhythmic  activity  in  relation 
to  particular  behavioral  states  (Sem-Jacobsen  et  al.  1956;  Chatrian  et  al. 
1960;  Perez-Borja  et  al.  1961).  Surface  EEG  and  MEG  recordings  have  re¬ 
vealed  7-frequency  components  in  the  auditory  evoked  potential  (Galam- 
bos  et  al.  1981;  Galambos  and  Makeig  1988;  Basar  1988;  Pantev  et  al.  1991) 
and  showed  a  broad  distribution  over  the  entire  cerebral  mantle  (Ribary  et 
al.  1991).  During  a  number  of  different  behavioral  states  the  hippocampus 
is  known  to  exhibit  some  of  the  most  robust  forms  of  synchronous  rhyth¬ 
mic  activity  to  be  observed  in  the  central  nervous  system.  Foremost  among 
these  is  the  theta-rhythm,  a  sinusoidal-like  oscillation  of  neuronal  activ¬ 
ity  at  4-10  Hz  that  occurs  during  active  movement  and  alert  immobility. 
Theta-field  potentials  are  often  synchronous  between  the  two  hemispheres 
and  over  distances  extending  up  to  8  mm  along  the  longitudinal  axis  of 
the  hippocampus  (Bland  et  al.  1975).  Local  populations  of  cells  also  ex¬ 
hibit  a  high  degree  of  synchronous  firing  during  theta-activity  (Kuperstein 
et  al.  1986).  In  addition,  two  other  hippocampal  neuronal  rhythms  have 
been  discovered,  one  having  a  frequency  near  40  Hz  that  occurs  during  a 
variety  of  behavioral  states  (Buzsaki  et  al.  1983)  and  is  both  locally  and  bi¬ 
laterally  synchronous.  Another  more  recently  discovered  signal  having  a 
frequency  around  200  Hz  is  associated  with  alert  immobility  and  the  pres¬ 
ence  of  sharp  waves  in  the  hippocampal  EEG  (Buzsaki  et  al.  1992).  These 
events  have  been  termed  population  oscillations  because  single  cells  do 
not  exhibit  high  frequency  periodic  firing.  Rather  they  fire  at  low  rate  in 
synchrony  with  the  surrounding  population  of  cells,  which  in  the  com¬ 
posite  yields  a  periodic  structure  that  is  synchronous  over  distances  up  to 
1.2  mm  (Buzsaki  et  al.  1992). 

There  is  thus  ample  evidence  from  brain  structures  other  than  the  visual 
cortex  that  groups  of  cells  engage  in  synchronous  rhythmic  activity  in  the 
7-frequency  range.  The  fact  that  this  activity  occurs  in  the  awake  brain 
and  increases  with  attention  and  preparation  of  motor  acts  suggests  that 
it  is  functionally  relevant.  An  early  Russian  study  described  phase  coher¬ 
ence  for  intracortically  recorded  field  oscillations  in  the  7-band  between 
visual,  acoustic,  and  motor  cortex  of  dogs.  This  coherence  developed  in  a 
highly  selective  way  in  the  course  of  Pavlovian  conditioning  while  the  dog 
acquired  sensorimotor  reactions  to  visual  or  acoustic  stimuli  (Dumenko 
1961).  When  the  dog  was  conditioned  to  withdraw  the  front  paw  after  a 
visual  warning  stimulus,  coherent  oscillations  became  manifest  over  the 
motor  representation  of  the  front  paw  and  the  visual  cortex.  When  the 
withdrawal  reflex  was  shifted  to  the  hind  paw  and  the  warning  stimulus 
to  a  tone,  coherent  oscillations  appeared  over  the  acoustic  cortex  and  the 
hind-paw  representation. 
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EVIDENCE  FOR  /?-  AND  7-BAND  OSCILLATIONS  IN  UNIT 
RECORDINGS  FROM  VISUAL  AND  NONVISUAL  STRUCTURES 


In  contrast  to  the  numerous  field  potential  studies  that  have  disclosed  the 
presence  of  /3-  and  7-oscillations  in  many  different  cortical  and  subcortical 
areas,  single  unit  analyses  designed  specifically  for  the  search  of  such  os¬ 
cillatory  discharge  patterns  have  often  failed  to  confirm  their  presence  or 
have  led  to  controversial  results:  So  far,  all  investigators  agree  that  oscillat¬ 
ing  unit  activity  in  the  7-range  occurs  in  the  primary  visual  cortex  of  cats 
and  monkeys,  whether  anesthetized  or  awake  (Gray  and  Singer  1987, 1989; 
Eckhornetal.  1988, 1992;  Raether  etal.  1989;  Ghose  and  Freeman  1990, 1992; 
Livingston  1991;  Schwarz  and  Bolz  1991;  Jagadeesh  et  al.  1992;  Gray  and 
Viana  di  Prisco  1993),  in  cat  area  18  (Eckhom  et  al.  1988, 1992;  Nelson  et  al. 
1992a),  and  in  area  PMLS  of  cat  visual  cortex  (Engel  et  al.  1991a).  For  area 
MT(V5)  of  the  visual  cortex  of  awake  monkeys  one  positive  report  (Kreiter 
and  Singer  1992)  stands  against  two  negative  findings  (Young  et  al.  1992; 
Bair  et  al.  1992).  No  evidence  was  found  in  temporal  visual  areas  of  the 
monkey  (Tovee  and  Rolls  1992)  but  Nakamura  et  al.  (1992)  observed  both 
low-  and  high-frequency  oscillations  associated  with  a  recognition  task  in 
the  temporal  pole  of  Macaca  mulatta.  High-frequency  oscillations  in  single 
cell  activity  have  also  been  observed  in  somatosensory  cortex  where  they 
were  suppressed  during  sensory  stimulation  (Ahissar  and  Vaadia  1990), 
in  the  frontal  cortex  where  they  occurred  in  relation  to  preparatory  phases 
of  motion  (Murthy  and  Fetz  1992;  Donoghue  and  Sanes  1991;  Gaal  et  al. 
1992),  and  in  prefrontal  cortex  where  they  were  associated  with  particular 
behavioral  sequences  (Aertsen  et  al.  1991).  Finally,  synchronous  multiu¬ 
nit  responses  very  similar  to  those  occurring  in  the  cat  visual  cortex  have 
been  observed  in  the  optic  tectum  of  awake  pigeons  (Neuenschwander 
and  Varela  1990).  Thus,  single  unit  data  from  a  number  of  different  labs 
agree  with  the  evidence  from  field  potential  and  EEG  recordings  that  os¬ 
cillatory  activity  in  the  7-range  is  a  common  phenomenon  in  certain  brain 
structures  but  there  are  also  several  negative  findings  that  challenge  this 
view.  A  possible  reason  for  this  discrepancy  between  single  cell  and  field 
potential  data  is  that  single  cell  recordings  are  not  well  adapted  to  disclose 
the  participation  of  a  cell  in  an  oscillatory  process  if  oscillation  frequency 
is  high  and  irregular  and  the  cell's  discharge  rate  low  (see  above). 

THE  DURATION  OF  COHERENT  STATES 

It  has  been  argued  that  synchronous  oscillatory  activity  is  unlikely  to  serve 
a  function  in  visual  processing  because  the  time  required  to  establish  and 
to  evaluate  synchrony  would  be  incompatible  with  the  short  recognition 
times  common  in  visual  perception  (Tovee  and  Rolls  1992).  The  following 
considerations  suggest  that  time  constraints  may  not  be  that  critical  even 
if  synchrony  is  used  as  a  code.  In  a  study  by  Gray  et  al.  (1992a)  recordings 
of  field  potential  and  unit  activity  were  performed  at  two  sites  in  cat  visual 
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cortex  having  a  separation  of  at  least  4  mm.  Field  potential  responses  were 
chosen  for  analysis  in  which  the  signals  displayed  particularly  robust  oscil¬ 
lations,  a  close  correlation  to  the  simultaneously  recorded  unit  activity,  and 
a  statistically  significant  average  cross-correlation.  Under  these  conditions 
it  became  possible  to  determine  (1)  the  onset  latency  of  the  synchronous 
activity;  (2)  the  time-dependent  changes  in  phase,  frequency,  and  duration 
of  the  synchronous  episodes  within  individual  trials;  and  (3)  the  intertrial 
variation  in  each  of  these  parameters. 

The  results,  combined  with  previous  observations  (Engel  et  al.  1990), 
demonstrated  that  correlated  responses  in  cat  visual  cortex  exhibit  a  high 
degree  of  dynamic  variability.  The  amplitude,  frequency,  and  phase  of  the 
synchronous  events  vary  over  time.  The  onset  of  synchrony  is  variable  and 
bears  no  fixed  relation  to  the  stimulus.  Multiple  epochs  of  synchrony  can 
occur  on  individual  trials  and  the  duration  of  these  events  also  fluctuates 
from  one  stimulus  presentation  to  the  next.  Most  importantly,  the  results 
demonstrated  that  response  synchronization  can  be  established  within  50- 
100  msec,  a  time  scale  consistent  with  behavioral  performance  on  visual 
discrimination  tasks  (Gray  et  al.  1992a). 

Similar,  rapid  fluctuations  between  synchronous  and  asynchronous  states 
have  been  observed  in  other  systems,  and  recent  methodological  develop¬ 
ments  have  made  a  quantitative  assessment  of  these  rapid  changes  possi¬ 
ble.  Using  the  joint-PSTH  (Aertsen  et  al.  1989)  and  gravitational  clustering 
algorithms  (Gerstein  et  al.  1985;  Gerstein  and  Aertsen  1985)  it  has  been 
possible  to  examine  the  time  course  of  correlated  firing  among  pairs  and 
larger  groups  of  neurons,  respectively  (Aertsen  et  al.  1991).  These  findings 
clearly  indicate  that  the  formation  of  coherently  active  cell  assemblies  is  a 
dynamic  process.  Patterns  of  synchronous  firing  can  emerge  from  seem¬ 
ingly  nonorganized  activity  within  tens  of  milliseconds,  and  can  change 
as  a  function  of  stimulus  and  task  conditions  within  similarly  short  time 
intervals.  These  findings  suggest  that  the  temporal  constraint  imposed  by 
perceptual  performance  can  be  met  by  the  dynamic  processes  that  underlie 
the  organization  of  synchronously  active  cell  assemblies. 

Theoretical  considerations  point  in  the  same  direction.  Assemblies  de¬ 
fined  by  synchronous  discharges  need  not  oscillate  at  a  constant  frequency 
over  prolonged  periods  of  time.  Rather,  it  is  likely  that  neuronal  networks 
that  have  been  shaped  extensively  by  prior  learning  processes  can  settle 
very  rapidly  into  a  coherent  state  when  the  patterns  of  afferent  sensory 
activity  match  with  the  architecture  of  the  weighted  connections  in  the 
network.  Such  a  good  match  can  be  expected  to  occur  for  familiar  patterns 
that  during  previous  learning  processes  had  the  opportunity  to  mould 
the  architecture  of  connections  and  to  optimize  the  fit.  If  what  matters 
for  the  nervous  system  is  the  simultaneity  of  discharges  in  large  arrays  of 
neurons,  a  single  episode  of  synchronous  discharges  in  thousands  of  dis¬ 
tributed  neurons  may  actually  be  sufficient  for  recognition.  Obviously,  the 
nervous  system  can  evaluate  and  attribute  significance  to  coherent  activity 
even  if  the  synchronous  event  is  only  of  short  duration  and  not  repeated 
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because  its  parallel  organization  allows  for  simultaneous  assessment  of 
highly  distributed  activity. 

Especially  if  no  further  ambiguities  have  to  be  resolved,  or  if  no  further 
modifications  of  synaptic  connectivity  are  required,  it  would  actually  be 
advantageous  if  the  system  would  not  enter  into  prolonged  cycles  of  rever¬ 
beration  after  having  converged  toward  an  organized  state  of  synchrony. 
Rather,  established  assemblies  should  be  erased  by  active  desynchroniza¬ 
tion  as  soon  as  possible  to  allow  for  the  build-up  of  new  representations. 
Thus,  when  processing  highly  familiar  patterns  or  executing  well-trained 
motor  acts  that  raise  no  combinatorial  problem  the  system  would  function 
nearly  as  fast  as  a  simple  feedforward  network.  Activity  will  be  routed 
selectively  through  the  network  of  tuned  connections  and  a  pattern  of  si¬ 
multaneous  discharges  could  emerge  in  the  corresponding  assembly  of 
distributed  cells  with  latencies  that  are  only  a  little  longer  than  the  sum  of 
the  conduction  and  integration  delays  along  the  path  of  excitation.  Con¬ 
vergence  toward  coherent  states  and  hence  "recognition"  might  actually 
occur  even  faster  than  one  might  predict  from  models  that  assume  that 
retinal  signals  are  relayed  serially  from  one  cortical  area  to  the  next  and 
that  recognition  occurs  only  once  cells  in  areas  at  the  top  of  the  processing 
hierarchy  get  driven. 

In  the  present  model  it  is  assumed  that  the  interconnected  cortical  areas 
are  permanently  active,  collectively  striving  toward  coherent  states.  In  this 
case  the  role  of  retinal  signals  is  not  to  provide  the  energy  for  the  successive 
excitation  of  serially  connected  cells  but  to  select  paths  of  convergence  to¬ 
ward  coherent  states  by  shifting  the  time  of  occurrence  of  discharges,  most 
of  which  would  have  occurred  anyway.  The  differential  and  flexible  rout¬ 
ing  of  activity  that  is  required  to  organize  the  appropriate  assemblies  could 
thus  be  achieved  in  parallel  within  and  among  the  different  processing  ar¬ 
eas  and  within  only  a  few  reentrant  cycles.  Because  the  network  is  assumed 
to  be  engaged  in  an  active  exchange  of  signals  even  if  "at  rest"  and  because 
during  the  organization  phase  interactions  only  need  to  influence  firing 
probability,  excitatory  postsynaptic  potentials  can  effectively  contribute  to 
the  organization  without  having  to  summate  until  they  evoke  strong  dis¬ 
charges.  Thus,  not  much  time  has  to  be  reserved  for  temporal  summation 
and  hence  the  duration  of  reentrant  cycles  can  be  as  short  as  the  sum  of  the 
net  conduction  and  synaptic  transmission  times.  This  possibility  of  rapid 
convergence  toward  coherent  states  and  the  option  to  maintain  such  states 
only  for  short  durations  is  fully  compatible  with  the  hypothesis  that  repre¬ 
sentations  consist  of  large  assemblies  of  coherently  active  neurons.  But  it 
may  become  very  difficult  to  experimentally  identify  the  coherent  states  of 
assemblies  if  these  last  only  for  very  short  periods  of  time  and  if  only  a  few 
neurons  can  be  recorded  simultaneously.  Thus,  as  long  as  experimenters 
can  assess  the  activity  of  only  a  few  neurons  at  a  time,  coherence  will  be 
detectable  only  if  it  is  maintained  over  sufficiently  long  periods.  Such  is 
likely  to  be  the  case  when  ambiguities  have  to  be  resolved  and  when  novel 
patterns  have  to  be  learned. 
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But  why  then  do  episodes  of  prolonged  coherence  occur  in  anesthetized 
preparations  and  in  response  to  rather  simple  stimulus  configurations.  In 
this  case  there  are  no  ambiguities  to  be  resolved  and  there  is  no  learning. 
At  present  one  can  only  speculate.  One  possibility  is  that  anesthesia  de¬ 
creases  the  efficiency  of  feedback  loops  and  associative  connections  (see, 
e.g.,  Cauller  and  Kulics  1991b)  and  that  this  reduces  the  complexity  of  the 
system.  In  the  absence  of  feedback,  neurons  at  peripheral  processing  lev¬ 
els  can  organize  their  responses  only  according  to  the  criteria  set  by  local 
connections  and  this  is  likely  to  result  in  rather  stereotyped  repetition  of 
"attempts"  to  organize.  Moreover,  it  is  likely  that  anesthesia  abolishes  also 
the  processes  that  would  normally  terminate  states  of  synchrony  once  the 
system  has  successfully  converged  toward  a  coherent  state  and  "recogni¬ 
tion"  has  occurred  (see  below).  This  interpretation  agrees  with  the  con¬ 
sistent  observation  that  episodes  of  response  synchronization  are  shorter 
and  less  stereotyped  in  awake  behaving  animals  (Kreiter  and  Singer  1992; 
Gray  and  Viana  di  Prisco  1993). 

EYE  MOVEMENTS  AND  SELECTIVE  ATTENTION 

So  far,  only  grouping  operations  that  do  not  require  scanning  eye  move¬ 
ments  or  shifts  of  selective  attention  have  been  considered.  However,  un¬ 
der  normal  viewing  conditions  both  processes  certainly  contribute.  While 
familiar  scenes,  even  if  they  are  complex,  are  usually  perceived  readily 
within  a  few  hundred  milliseconds,  recognition  of  unfamiliar  objects  or 
analysis  of  scenic  details  does  require  more  time.  Often  it  is  even  necessary 
to  successively  sample  parts  of  the  pattern  by  scanning  eye  movements. 
This  requires  that  representations  of  components  of  the  pattern  are  main¬ 
tained  (remembered)  during  successive  eye  movements  to  allow  for  the 
synthesis  of  successively  perceived  components.  But  it  also  requires  that 
patterns  at  the  more  peripheral  levels  of  analysis  change  from  eye  move¬ 
ment  to  eye  movement.  Thus,  the  possibility  needs  to  be  considered  that 
the  temporal  scales  at  which  activity  patterns  become  organized  differ  at 
different  levels  of  integration.  At  the  levels  directly  involved  in  the  seg¬ 
mentation  of  scenes  and  the  appropriate  grouping  of  features,  organized 
states  should  last  only  briefly  and  they  should  definitely  not  outlast  the 
duration  of  the  retinal  image.  Moreover,  they  should  be  reset  with  each 
eye  movement  to  reduce  confusion  between  successive,  often  unrelated 
images.  Both  postulates  agree  with  psychophysical  evidence. 

Patterns  that  are  too  complex  to  be  remembered  at  a  semantic  level  or 
that  defy  semantic  description  remain  represented  only  little  more  than 
100  msec  after  their  offset.  If  interrupted  for  more  than  150  msec,  the  ap¬ 
proximate  duration  of  visual  persistence,  modifications  of  the  patterns  go 
undetected  when  introduced  during  the  interruption  interval  (Phillips  and 
Singer  1974;  Di  Lollo  and  Wilson  1978).  Likewise,  saccadic  eye  movements 
also  disrupt  the  ability  to  detect  changes  if  these  are  introduced  while  the 
eyes  move.  Thus,  disrupting  the  continuous  flow  of  retinal  activity  by 
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transiently  obscuring  the  pattern  or  by  making  a  saccadic  eye  movement 
appears  to  erase  completely  the  organized  state  induced  by  the  pattern 
that  was  present  prior  to  the  interruption.  In  case  of  eye  movements  it 
actually  appears  as  if  this  resetting  is  an  active  process.  The  fact  that  the 
pontogeniculooccipital  waves  (PGO  waves)  that  accompany  saccadic  eye 
movements  are  prominent  in  the  lateral  geniculate  nucleus  and  in  visual 
areas  of  the  occipital  lobe  but  not  in  more  frontally  located  areas  is  com¬ 
patible  with  such  a  view. 

It  has  been  suggested  that  the  PGO  waves  or  eye  movement  potentials 
reflect  corollary  activity  that  is  generated  in  the  brain  stem  in  association 
with  saccadic  eye  movements  and  serve  to  erase  or  reset  activation  patterns 
in  peripheral  visual  centers  each  time  an  eye  movement  is  executed  (for 
review  see  Singer  1977, 1979).  In  the  framework  of  the  present  model  this 
resetting  would  have  to  consist  of  disrupting  assemblies  that  have  become 
organized  in  response  to  the  pattern  processed  prior  to  the  saccade.  Hence 
this  resetting  should  act  by  decorrelating  previously  synchronized  activity. 
Evidence  is  indeed  available  that  the  activity  that  underlies  PGO  waves 
does  have  a  desynchronizing  effect  both  at  the  thalamic  and  the  cortical 
level  (for  review  see  Singer  1977,  1979;  Steriade  and  McCarley  1990).  In 
addition  it  raises  cortical  excitability  (for  review  see  Singer  1979;  Steriade 
1991),  which  should  in  turn  favor  rapid  self-organization  of  new  assemblies 
in  response  to  the  pattern  that  is  going  to  be  processed  after  the  saccade  has 
occurred.  In  higher  areas,  by  contrast,  which  are  located  more  frontally 
and  perform  much  more  abstracted,  semantic  descriptions  of  patterns, 
such  automatic  eye  movement-related  resetting  must  not  occur.  Here,  the 
time  frames  for  the  organization  of  assemblies  should  be  set  in  a  flexible 
way  and  adjusted  according  to  the  actual  time  it  takes  to  organize  coherent 
assemblies.  If  oscillatory  activity  is  instrumental  to  establish  coherence 
among  distributed  cells  and  to  organize  assemblies,  one  might  then  also 
expect  that  the  frequency  of  these  oscillations  could  be  much  slower  in 
these  higher  areas  and  perhaps  even  varied  as  a  function  of  the  actually 
required  integration  times. 

SYNCHRONIZATION  AND  ATTENTION 

The  hypothesis  that  information  about  feature  constellations  is  contained 
in  the  temporal  relation  between  the  discharges  of  distributed  neurons, 
and,  in  particular,  in  their  synchrony,  has  also  some  bearing  on  the  organi¬ 
zation  of  attentional  mechanisms.  It  is  obvious  that  synchronous  activity 
will  be  more  effective  in  driving  cells  at  higher  levels  than  nonorganized 
asynchronous  discharges. 

Thus,  those  assemblies  would  appear  as  particularly  salient  and  hence 
effective  in  attracting  attention  that  succeed  to  make  their  discharges  coher¬ 
ent  with  shorter  latency  and  higher  temporal  precision  than  others.  Con¬ 
versely,  responses  of  neurons  reacting  to  features  that  cannot  be  grouped  or 
bound  successfully,  and,  hence,  cannot  be  synchronized  with  the  responses 
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of  other  neurons,  would  have  only  a  small  chance  of  being  relayed  further 
and  to  influence  shifts  of  selective  attention.  It  is  thus  conceivable  that 
of  the  many  responses  that  occur  at  peripheral  stages  of  visual  process¬ 
ing  only  a  few  are  actually  passed  on  toward  higher  levels.  These  would 
either  be  responses  to  particularly  salient  stimuli  causing  strong  and  si¬ 
multaneous  discharges  in  a  sufficient  number  of  neurons  or  responses  of 
cells  that  succeeded  in  being  organized  in  sufficiently  coherent  assemblies. 
Thus,  responses  to  changes  in  stimulus  configuration  or  to  moving  targets 
have  a  good  chance  to  be  passed  on  even  without  getting  organized  in¬ 
ternally  because  they  would  be  synchronized  by  the  external  event.  But 
responses  to  stationary  patterns  will  require  organization  through  internal 
synchronization  mechanisms  to  be  propagated. 

This  interpretation  implies  that  neuronal  responses  that  attract  atten¬ 
tion  and  gain  control  over  behavior  should  differ  from  nonattended  re¬ 
sponses  not  so  much  because  they  are  stronger  but  because  they  are  bet¬ 
ter  synchronized  among  one  another.  A  neuronal  network  model  using 
synchronization  rather  than  rate  modulation  of  discharges  as  a  code  for 
saliency  in  attentional  processes  has  recently  been  realized  by  Niebur  et 
al.  (1993).  Following  the  same  reasoning,  shifting  attention  by  top-down 
processes  would  be  equivalent  with  biasing  synchronization  probability  of 
neurons  at  lower  levels  by  feedback  connections  from  higher  levels.  These 
top-down  influences  could  favor  the  emergence  of  coherent  states  in  se¬ 
lected  subpopulations  of  neurons — the  neurons  that  respond  to  contours 
of  an  "attended"  object  or  pattern.  Thus,  the  mechanism  that  allows  for 
grouping  and  scene  segmentation — the  organization  of  synchrony — could 
also  serve  the  management  of  attention.  The  advantage  would  be  that 
nonattended  signals  do  not  have  to  be  suppressed,  which  would  hitherto 
eliminate  them  from  competition  for  attention.  Rather,  cells  could  remain 
active  and  thus  be  rapidly  recruitable  into  an  assembly  if  changes  of  af¬ 
ferent  activity  or  of  feedback  signals  modify  the  balance  among  neurons 
competing  for  the  formation  of  synchronous  assemblies. 

In  a  similar  way  shifts  of  attention  across  different  modalities  could  be 
achieved  by  enhancing  selectively  synchronization  probability  in  particu¬ 
lar  sensory  areas  and  not  in  others.  This  could  be  achieved,  for  example, 
by  modulatory  input  from  the  basal  forebrain  or  nonspecific  thalamic  nu¬ 
clei.  If  these  projection  systems  were  able  to  modulate  in  synchrony  the 
excitability  of  cortical  neurons  distributed  in  different  areas  this  would 
greatly  enhance  the  probability  that  these  neurons  link  selectively  with 
each  other  and  join  into  coherent  activity.  Such  linking  would  be  equiva¬ 
lent  with  the  binding  of  the  features  represented  in  the  respective  cortical 
areas.  Again,  this  view  equates  grouping  or  binding  mechanisms  with 
attentional  mechanisms.  The  "attention"  directing  systems  would  simply 
have  to  provide  a  temporal  frame  within  which  distributed  responses  can 
then  self-organize  toward  coherent  states  through  the  network  of  selective 
corticocortical  connections.  In  doing  so  the  attentional  systems  need  not 
themselves  produce  responses  in  cortical  neurons.  It  would  be  sufficient 
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that  they  cause  a  synchronous  modulation  of  their  excitability.  It  is  con¬ 
ceivable  that  the  synchronous  field  potential  oscillations  that  have  been 
observed  in  animals  and  humans  during  states  of  focused  attention  are 
the  reflection  of  such  an  attention  mechanism  (for  review  of  the  extensive 
literature  see  Singer  1993).  The  observations  that  these  field  potential  os¬ 
cillations  are  only  loosely  related  to  the  discharge  probability  of  individual 
neurons,  are  coherent  across  different  cortical  areas,  are  particularly  pro¬ 
nounced  when  the  subjects  are  busy  with  tasks  requiring  integration  of  ac¬ 
tivity  across  different  cortical  areas  and  stop  immediately  when  the  binding 
problem  is  solved — as  witnessed  by  the  execution  of  a  well-programmed 
motor  act — are  in  agreement  with  such  an  interpretation. 

SUMMARY 

In  this  section  a  scenario  of  cortical  processes  is  developed  in  which  re¬ 
sponse  synchronization  is  used  for  scene  segmentation,  perceptual  group¬ 
ing,  and  the  organization  of  sensory  representations.  The  essential  ingre¬ 
dients  of  this  model  are  depicted  schematically  in  figure  10.4.  The  different 
boxes  stand  for  some  of  the  numerous  cortical  areas  devoted  to  the  process¬ 
ing  of  retinal  signals.  The  arrows  between  them  symbolize  the  possibility 
of  a  reciprocal  flow  of  signals  between  areas  at  similar  and  different  levels 
of  the  processing  hierarchy.  For  a  detailed  description  of  the  connectivity 
pattern  between  different  visual  areas  the  reader  is  referred  to  Felleman 
and  Van  Essen  (1991)  and  Young  (1992).  On  presentation  of  a  complex 
visual  scene  the  following  sequence  of  events  is  assumed  to  occur.  Neu¬ 
rons  in  VI  that  encounter  a  preferred  feature  in  their  receptive  field  start 
responding.  At  the  very  same  time  these  responses  become  organized  due 
to  the  action  of  the  tangential  connections  within  VI.  Because  of  the  spe¬ 
cific  architecture  of  these  connections  neurons  coactivated  by  continuous 
contours  or  nearby  contours  with  similar  orientation,  or  neurons  activated 
by  colinear  contour  segments  will  tend  to  synchronize  their  activity.  While 
this  organization  proceeds  in  VI  signals  are  passed  on  to  other  areas  where 
similar  organization  processes  are  initiated.  Of  the  many  responses  in  VI 
those  that  became  synchronized  best  will  be  particularly  effective  in  influ¬ 
encing  neurons  in  higher  areas.  Therefore,  response  constellations  that  fit 
the  grouping  criteria  set  by  the  architecture  of  tangential  connections  in 
VI  will  be  passed  on  and  processed  further  with  higher  probability  than 
incoherent  responses  that  also  arrive  from  VI.  Because  the  connections 
from  VI  to  the  other  areas  convey  already  preprocessed  activity  and  by 
divergence  and  convergence  allow  for  remapping  of  neighborhood  rela¬ 
tions,  the  grouping  criteria  in  these  higher  areas  should  differ  from  those 
in  VI.  Thus,  it  is  assumed  that  in  V5  those  neurons  have  a  tendency  to 
synchronize  their  responses  which  code  for  the  same  direction  of  motion. 
Because  the  neurons  in  V5  have  large  receptive  fields,  and  hence  a  great 
aperture,  and  are  also  sensitive  to  relative  motion,  this  area  can  evaluate 
coherent  motion  both  in  relative  and  absolute  terms  over  large  distances. 
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While  the  responses  in  V5  become  organized  according  to  the  grouping 
criteria  set  by  the  intrinsic  interactions  within  V5  it  is  assumed  that  they 
influence  via  the  backprojections  the  organization  process  in  VI,  adding 
the  criterion  of  coherent  motion  to  the  grouping  process  in  VI.  This  top- 
down  influence  is  thought  to  bias  synchronization  probability  between 
neurons  in  VI  either  toward  more  or  toward  less  synchrony,  depending 
on  stimulus  configuration.  Responses  to  contour  elements  that  are  far 
apart  and  have  different  orientations  have  a  low  probability  of  becoming 
synchronized  by  local  interactions  within  VI.  However,  if  these  contour 
elements  move  coherently  their  coherence  would  be  detected  by  neurons 
in  V5  responses  to  these  contours  would  synchronize  in  V4  and  through 
the  backprojections  increase  synchronization  probability  for  the  respective 
set  of  neurons  in  VI.  Such  top-down  influences  from  motion-sensitive  ar¬ 
eas  with  large  aperture  could  account  for  the  observation  that  coherently 
moving  line  segments  lead  to  synchronization  of  responses  in  area  17  even 
if  the  cortical  representations  of  these  line  segments  are  much  further  apart 
than  the  maximal  span  of  the  tangential  intracortical  connections  (see,  for 
example.  Gray  et  al.  1989).  The  finding  that  pharmacological  inactivation 
of  cells  in  motion-sensitive  areas  reduces  considerably  response  synchro¬ 
nization  to  coherently  moving  contours  in  VI  supports  this  possibility 
(Nelson  et  al.  1992b).  Conversely,  responses  to  nearby  contours  of  simi¬ 
lar  orientation  that  would  have  a  tendency  to  become  synchronized  due 
to  the  local  interactions  in  VI  may  be  prevented  from  synchronizing  by 
top-down  influences  from  motion-sensitive  areas  if  the  contours  move  in 
different  directions  and  with  different  speed.  Such  differences  in  motion 
trajectories  have  been  shown  to  prevent  neurons  in  motion-sensitive  areas 
from  synchronizing  (Kreiter  and  Singer  1992),  and  hence,  activity  in  the 
backprojections  would  either  not  favor  the  occurrence  of  synchrony  in  VI 
or  even  actively  reduce  its  probability.  Similar  grouping  operations  are 
assumed  to  ocdur  simultaneously  in  numerous  other  prestriate  areas  but 
according  to  different  criteria. 

Thus,  while  one  area  explores  similarities  in  color  space,  another  may 
search  for  related  textures,  and  yet  another  for  similarities  in  retinal  dis¬ 
parity  etc.  The  results  of  these  evaluations,  which  can  all  occur  in  parallel, 
are  sent  back  to  VI  where  they  all  contribute  to  the  ongoing  organization 
process.  As  a  consequence,  the  synchronization  probabilities  among  neu¬ 
rons  in  area  17  change  and  this  in  turn  modifies  the  input  configurations 
to  prestriate  areas.  While  the  distributed  search  for  the  most  probable 
grouping  constellations  proceeds,  areas  at  the  top  of  the  processing  hier¬ 
archy  will  also  become  involved.  Because  of  the  polysynaptic  nature  of 
the  input  to  these  areas  they  will  probably  become  active  only  once  ac¬ 
tivity  in  the  preceding  areas  has  become  sufficiently  coherent,  but  then 
responses  should  be  organized  in  the  higher  areas  according  to  the  same 
general  rules  as  at  peripheral  levels  given  the  similarity  of  the  intrinsic 
organization  of  the  different  cortical  areas.  The  grouping  criteria  will  be 
much  more  complex  at  these  higher  levels,  however,  because  interactions 
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now  involve  neurons  that  represent  complicated  constellations  of  features 
such  as  figural  components  and  higher  order  geometric  shapes  (Tanaka 
et  al.  1991;  Gallant  et  al.  1993).  Because  these  higher  areas  are  connected 
to  lower  areas  via  massive  backprojections,  it  must  be  assumed  that  once 
coherent  patterns  became  organized  at  higher  levels  these  influence  in  turn 
the  organization  of  patterns  at  lower  levels.  These  processes  can  all  occur 
nearly  simultaneously  as  the  areas  concerned  are  all  interconnected  either 
directly  or  via  oligosynaptic  pathways.  Thus,  the  process  of  organizing 
the  neuronal  representation  of  a  scene  consists  of  parallel  operations  that 
occur  nearly  simultaneously  at  different  levels  of  the  processing  hierarchy 
and  according  to  similar  rules.  But  because  of  differences  in  the  way  in 
which  ascending  activity  from  VI  is  mapped  into  different  areas,  the  eval¬ 
uation  criteria  differ  for  each  area  and  increase  in  complexity  as  one  moves 
away  from  VI.  In  this  model  decisions  required  for  successful  perceptual 
grouping  and  scene  segmentation  are  thus  based  on  a  highly  distributed 
voting  operation  where  each  of  the  different  areas  contributes  its  "point  of 
view"  and  where  both  bottom-up  and  top-down  processes  are  intimately 
interleaved. 

Each  of  the  areas  explores  the  feature  space  for  which  it  is  predisposed  by 
its  specific  afferent  and  intrinsic  connectivity,  searches  for  coherence,  and 
distributes  the  result  of  its  computation  simultaneously  to  all  the  areas  to 
which  it  is  connected.  These  messages  are  assumed  to  bias  the  probabilities 
with  which  neurons  in  the  respective  target  areas  are  going  to  synchronize 
or  desynchronize  their  discharges. 

Successful  segmentation  could  thus  be  viewed  as  the  result  of  a  self¬ 
organizing  process  that  converges  toward  the  state  of  maximal  probability. 
If  scenes  contain  little  ambiguity  with  respect  to  the  grouping  criteria  that 
are  stored  in  the  architecture  of  connections  within  and  between  areas,  the 
organization  process  can  be  very  rapid  and  in  extreme  cases  it  may  not  even 
require  the  contribution  of  backprojected  activity.  This  could  even  be  true 
for  complex  scenes  if  they  contain  mainly  familiar  objects.  In  this  case  the 
pattern  of  sensory  activity  would  match  directly  with  the  functional  archi¬ 
tecture  of  coupling  connections  that  has  been  shaped  by  previous  learning 
and  the  system  can  converge  nearly  instantaneously  into  a  coherent  state. 
Under  such  conditions  the  system  would  function  in  a  way  that  is  not  too 
different  from  a  multilayered  feedforward  network.  However,  if  the  scene 
contains  ambiguities  allowing  for  several  equally  likely  groupings  or  if 
it  is  highly  unfamiliar,  convergence  may  occur  only  after  seconds.  Such 
extreme  processing  times  may  actually  be  required  for  the  segmentation 
of  figures  defined  solely  by  similar  disparity  in  random  dot  patterns  or 
for  the  detection  of  figures  hidden  in  background  textures  by  camouflage 
as  for  example  the  well-known  Dalmatian  dog.  In  both  cases  it  is  helpful 
and  reduces  recognition  time  if  one  already  knows  what  the  figure  is,  a 
pragmatic  proof  of  the  notion  that  high  level  representations  can  directly 
influence  figure-ground  segmentation  via  top-down  biasing  of  peripheral 
grouping  criteria.  In  case  of  the  Dalmatian  dog,  for  example,  recognition 
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could  be  sped  up  either  by  top-down  propagation  if  previous  experience 
has  already  installed  grouping  criteria  at  the  levels  where  figural  attributes 
are  bound  together  or  if  one  provided  additional  cues  that  would  facilitate 
grouping  by  bottom-up  processes  at  peripheral  levels.  If  the  contour  ele¬ 
ments  constituting  the  dog  had  any  of  the  properties  in  common  which  VI 
and  prestriate  areas  can  probably  evaluate  and  relate  to  one  another,  such 
as  disparity,  color,  motion,  orientation,  and  texture,  segmentation  would 
occur  much  faster. 

In  this  scenario  a  pattern  is  perceived  as  soon  as  segmentation  is  com¬ 
pleted  and  neurons  have  become  organized  in  distinct,  coherently  active 
assemblies.  In  that  case,  their  output  will  be  sufficiently  coherent  to  allow 
for  the  propagation  of  signals  to  remote  cortical  areas  and  ultimately  to 
effector  levels.  For  this  to  occur  it  is  necessary  not  only  that  enough  cells 
coordinate  their  responses,  but  also  that  the  spatial  distribution  of  these 
coherently  active  cells  matches  the  "receptive  field"  properties  of  cells  at 
higher  levels.  Just  as  cells  in  VI  are  selective  for  particular  spatiotemporal 
patterns  of  retinal  input,  cells  in  higher  cortical  areas  are  likely  to  be  activat- 
able  only  by  the  appropriate  spatiotemporal  patterns  that  have  organized 
in  more  peripheral  cortical  areas.  But  in  contrast  to  the  retinal  and  thala¬ 
mic  activation  patterns,  these  cortical  activation  patterns  are  no  longer  a 
direct  reflection  of  the  retinal  image  but  a  result  of  a  highly  dynamic  self¬ 
organizing  process.  The  organization  of  the  spatial  and  temporal  structure 
of  these  patterns  is  initiated  by  the  retinal  input,  but  then  it  is  extensively 
modified  by  dynamic  interactions  that  are  determined  essentially  by  the 
functional  architecture  of  connections  linking  cells  within  and  between 
areas.  The  proposal  is  that  this  organization  process  converges  toward 
coherent  states  in  which  responses  that  need  to  be  related  to  one  another 
are  tagged  by  their  synchrony. 

Following  the  same  line  of  reasoning  it  is  also  possible  that  access  to  the 
level  of  processing  where  representations  reach  consciousness  is  gated  by 
coherence.  As  proposed  by  Crick  and  Koch  (1990a)  it  is  conceivable  that 
only  those  activation  patterns  (assemblies)  reach  the  threshold  of  conscious 
awareness  that  are  sufficiently  organized,  that  is  coherent. 
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What  Form  Should  a  Cortical  Theory  Take? 

Charles  F.  Stevens 


Although  neurobiology  has  accumulated  an  impressive  body  of  informa¬ 
tion  about  neocortical  structure  and  operation,  the  nature  of  the  mathe¬ 
matical  computations  performed  by  cortex  remains  a  mystery.  A  descrip¬ 
tion  of  these  computations  is  equivalent  to  developing  a  theory  of  cortical 
function,  and  the  construction  of  such  a  theory  is  necessarily  one  of  neu¬ 
robiology's  central  problems.  Although  this  chapter  deals  with  cortical 
theory,  its  goal  is  much  less  ambitious  than  proposing  what  cortex  com¬ 
putes.  Rather  it  attempts  one  of  the  first  steps  in  that  direction:  to  outline 
what  form  a  cortical  theory  should  take. 

Why  try  to  define  the  form  for  such  a  theory  when  we  surely  are  a  long 
way  from  being  able  to  develop  an  adequate  theoretical  structure?  A  strong 
form  of  the  argument  for  this  approach  is  as  follows:  A  particular  cortical 
region — primary  visual  cortex,  for  instance — performs  some  computations 
on  its  inputs  to  determine  what  information  is  sent  to  other  areas.  The  types 
of  computations  one  tends  to  think  of  are  Fourier  or  Gabor  transforms,  cal¬ 
culation  of  cross-correlation  functions,  or  deconvolutions,  but  the  actual 
computations  may  not  be  ones  that  are  currently  familiar.  As  a  prerequisite 
for  understanding  the  role  of  a  particular  region  in  the  overall  cortical  pro¬ 
cessing  of  information,  then,  we  must  identify  the  computations  carried  out 
by  that  region.  And  before  these  computations  can  be  recognized,  we  must 
decide  what  sort  of  mathematical  machinery  is  to  be  used  for  their  charac¬ 
terization.  For  example,  should  we  make  a  probabilistic  description,  or  is 
a  deterministic  one  adequate?  Identifying  the  nature  of  the  theory  we  seek 
is  essential,  because  this  determines,  to  a  great  extent,  what  sort  of  experi¬ 
ments  are  needed  and  what  further  theoretical  approaches  should  be  tried. 

FOUR  REQUIREMENTS  FOR  A  THEORY  OF  CORTEX 

The  first  step  in  defining  the  nature  of  a  cortical  theory  is  to  identify  some 
of  the  theory's  requirements.  Here  four  requirements  for  a  theory  of  cortex 
are  proposed  and  discussed. 

Before  a  complete  cortical  theory  could  be  developed,  one  must  know: 
How  many  different  types  of  cortex  are  there?  That  is,  how  many  theories 
are  required?  Clearly,  very  many  functionally  distinct  cortical  regions  can 


be  recognized,  over  30  in  the  visual  system  alone  (Felleman  and  Van  Essen, 
1991).  But  functionally  distinct  areas  may  not  be  computationally  unique. 
For  example,  the  functional  differences  between  areas  (e.g.,  VI  vs.  MT) 
might  reside  more  in  the  nature  and  distribution  of  cortical  inputs  and  on 
the  disposition  of  cortical  outputs  than  in  the  operations  performed  by  the 
cortical  circuits  on  the  information  they  receive.  Specifically,  functionally 
distinct  cortical  regions,  like  VI  and  MT,  might  perform  identical  mathe¬ 
matical  operations  on  different  sorts  of  inputs.  This  general  notion  has  been 
proposed  repeatedly  (Lorente  de  N 6  1949;  Creutzfeldt  1977;  Powell  1981; 
Eccles  1984),  and  is  supported  by  a  variety  of  developmental,  anatomical, 
and  physiological  observations. 

If  "type  of  cortex"  refers  to  the  mathematical  computation  performed, 
rather  than  to  the  some  functional  difference  between  areas  revealed,  say, 
by  different  receptive  field  structures,  one  might  believe  that  only  a  single 
major  computational  type  of  cortex  exists;  this  can  be  called  the  "unitary 
theory."  Alternatively,  activity-dependent  rewiring  of  cortical  circuits  (see, 
for  example,  Shatz  1990)  could  modify  the  computations  performed  by 
even  initially  uniform  cortices,  so  that  the  character  of  some  mathemati¬ 
cal  operation  might  vary  continuously  across  even  an  apparently  uniform 
cortical  region  like  primary  visual  cortex;  this  other  limiting  case  could  be 
termed  the  "continuous  diversity  theory."  Most  hopeful  for  neurobiology 
is  the  unitary  theory:  if  this  were  true,  understanding  the  basic  computa¬ 
tions  carried  out  by  any  cortical  area  would  then  provide  an  answer  for 
all  of  cortex.  Unraveling  cortical  function  in  this  limit  would  simply  (!) 
amount  to  learning  how  inputs  and  outputs  are  mapped. 

The  first  requirement  for  the  framework  of  a  cortical  theory,  then,  is  that 
it  must  be  able  to  accommodate  the  spectrum  of  possibilities,  from  the 
unitary  to  the  continuous  diversity  views. 

How  many  inputs  and  outputs  are  present  in  cortex?  The  answer  de¬ 
pends  on  how  many  classes  of  neurons  are  present.  Currently  available  in¬ 
formation  (see,  for  example,  Purves  et  al.  1992)  indicates  that  the  number  of 
input  and  output  types  should  be  greater  than  one,  but  not  a  large  number, 
perhaps  10  to  100.  The  cortical  inputs  and  outputs  must  thus  be  described 
by  vectors  whose  dimensions  are  to  be  determined  experimentally.  Note 
that  inputs  include  information  sent  to  one  cortical  region  from  another 
one  and  that  outputs  are  defined  in  any  convenient  way.  At  this  stage,  a 
cortical  theory  must  be  able  to  accommodate  an  arbitrary  number  of  inputs 
and  outputs.  This  constitutes  an  additional  part  of  the  first  requirement. 

The  second  requirement  is  that  the  theory  be  an  explicitly  probabilistic  one 
because  synaptic  transmission  is  a  stochastic  process  (Katz  1969):  Neu¬ 
rotransmitter  is  released  at  axon  terminals  in  packets — called  quanta — so 
that  the  total  effect  of  a  nerve  impulse  arrival  is  an  integral  multiple  of 
the  smallest  effect,  the  one  produced  by  a  single  quantum.  The  quanta, 
however,  are  released  probabilistically  according  to  a  Poisson  process  (see 
Barrett  and  Stevens  1972)  with  a  Poisson  rate  A (t)  that  depends  on  time  t. 
Normally,  A (t)  is  very  small,  but  just  after  a  nerve  impulse  arrives  at  the 
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synapse,  the  release  rate  increases  transiently.  The  net  effect  is  that  the 
size  of  the  postsynaptic  response  due  to  the  arrival  of  a  nerve  impulse  at 
an  axon  terminal  varies  at  random.  The  specific  need  for  a  probabilistic 
theory  in  brain  arises  from  the  following  considerations. 

The  high  signal-to-noise  ratio  of  whole  cell  recording  and  the  use  of 
methods  that  cause  localized  release  of  neurotransmitter  have  permitted 
the  characteristics  of  individual  quanta  to  be  determined  for  central  neu¬ 
rons  (Bekkers  and  Stevens  1989;  Edwards  et  al.  1990;  Bekkers  et  al.  1990; 
Manabe  et  al.  1992;  Raastad  et  al.  1992;  Silver  et  al.  1992).  Further,  this 
knowledge  of  quantal  size,  and  its  variation,  has  made  possible  a  rigorous 
quantal  analysis  of  central  synapses  (Bekkers  and  Stevens  1989, 1990).  The 
conclusion  of  these  investigations  is  that  the  probability  of  release  at  an 
individual  synapse  is  generally  very  low,  about  0.1  to  0.5. 

Sometimes  the  number  of  synapses  one  neuron  makes  with  another 
can  be  determined:  in  cortex,  any  particular  neuron  generally  seems  to 
receive  only  one  or  two  synapses  from  any  other  neuron.  In  hippocampus, 
for  example,  Andersen  (1990)  estimated  that  a  given  axon  usually  makes 
only  a  single  synapse  (average  estimated  to  be  1.3)  on  its  target  cell.  The 
lateral  geniculate  axons  that  project  to  visual  cortex  also  make  only  one  or 
a  few  (up  to  about  eight)  synapses  on  their  targets  (Freund  et  al.  1985). 
Taken  together,  then,  these  observations  indicate  that  when  a  pair  of  cells 
is  connected,  the  communication  link  between  them  is  quite  unreliable  for 
a  single  impulse  arrival,  although  it  is  predictable  in  a  statistical  sense. 

Experimental  confirmation  of  this  conclusion  is  available  for  hippocam¬ 
pus  and  primary  visual  cortex.  When  the  intensity  of  a  stimulus  applied  to 
axons  that  project  onto  CA1  neurons  is  reduced  to  low  levels — intensities 
perhaps  adequate  for  stimulating  just  a  single  axon  —  only  a  small  fraction 
of  stimuli  (about  0.1  to  0.5)  produces  postsynaptic  currents,  that  is,  about 
five  out  of  ten  of  the  nerve  impulses  generated  by  a  cell  produce  no  post¬ 
synaptic  response  at  a  given  target  neuron.  These  //minimal,,stimuli  may 
actually  stimulate  more  than  one  axon,  and  each  axon  may  make  more  than 
one  synapse  with  its  target  cell.  In  any  event,  two  conclusions  are  secure 
for  hippocampus:  (1)  the  release  probability  of  about  0.5  estimated  in  this 
way  is  an  upper  limit  for  the  actual  release  probability  and  (2)  the  effect  of 
one  neuron  on  another  is  generally  small  and  uncertain.  This  conclusion  is 
supported  in  a  general  way  by  the  correlational  analysis  by  Tanaka  (1983), 
which  shows  that  even  when  geniculate  and  VI  receptive  fields  overlap, 
the  extent  to  which  a  geniculate  discharge  predicts  a  cortical  neuronal  dis¬ 
charge  is  only  about  0.1.  Note  that  although  the  Tanaka  result  is  difficult 
to  connect  directly  to  our  argument  because  of  the  complexity  of  his  ex¬ 
perimental  situation  and  the  indirect  nature  of  his  measures  of  cell-to-cell 
connections,  his  observations  do  support  the  notion  that  a  single  input  has 
only  a  relatively  small  effect  on  its  target.  Because  transmission  at  a  sin¬ 
gle  synapse  is  so  unreliable,  and  because  each  neuron  makes  only  a  few 
synapses  with  its  target  cell,  neuron  to  neuron  communication  must  also 
be  quite  unreliable:  the  random  nature  of  synaptic  transmission  makes 
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neuronal  behavior  uncertain  and  thus  networks  of  such  neurons  must  be 
described  probabilistically.  These  statements  apply,  of  course,  to  many, 
but  probably  not  all,  cortical  neurons.  Examples  are  known  (Purkinje  cells 
in  the  cerebellum,  for  example)  in  which  one  neuron  makes  thousands  of 
synapses  on  its  target  cell,  and  statistical  fluctuations  in  synaptic  strength 
are  very  small. 

Now  we  begin  the  background  discussion  for  the  third  requirement, 
which  is,  we  shall  argue,  that  the  theory  should  be  continuous  and  should 
treat  the  cortex  in  a  coarse-grained  way.  One  might  think  that,  ideally, 
a  cortical  theory  should  start  from  details  of  neuronal  properties  and  the 
principles  that  determine  connections  of  cortical  circuitry,  and  then  derive 
from  these  a  description  of  the  computations  performed  by  each  neuron 
and  thus  by  the  network  as  a  whole.  But  such  a  cell-by-cell  description 
is  probably  neither  feasible  nor  desirable.  A  cubic  millimeter  of  cortex-— a 
good  candidate  for  the  size  of  a  computational  unit— contains  105  neurons, 
109  synapses,  and  two  miles  of  axons;  each  neuron  receives  about  104 
synapses  and  communicates  with  about  104  other  neurons  (see  White  1989; 
Stevens  1989;  Braitenberg  and  Schuz  1991).  Most  of  these  connections  are 
intracortical  (Peters  1987).  Furthermore,  the  average  effect  one  cortical 
neuron  has  on  another  is  quite  small,  ranging  from  less  than  50  /iV  to 
several  millivolts  (Thompson  etal.  1988;  Mason  etal.  1991).  Because  single 
neurons  have  small  and  uncertain  effects  on  other  neurons,  the  cortical 
description  must  be  carried  out  in  terms  of  neuronal  populations  rather 
than  at  the  level  of  individual  cells. 

A  consideration  of  cortical  anatomy  points  to  the  nature  of  the  neuronal 
populations  that  form  the  natural  basis  for  a  cortical  theory.  The  argument 
will  be  made  in  terms  of  cat  primary  visual  cortex  layer  4,  but  the  same 
conclusions  are  reached  for  any  cortical  region.  Layer  4  neurons  have 
a  dendritic  tree  with  a  diameter  of  about  0.3  mm  (Martin  1984).  Pick  a 
particular  neuron  as  a  reference  and  ask:  how  many  layer  4  cells  have 
dendritic  trees  that  overlap  that  of  the  reference  cell  and  thus  potentially 
have  access  to  the  reference  cell's  synaptic  input?  Layer  4  is  about  0.3 
mm  thick  and  cortex  has  a  density  of  about  105  neurons  per  mm3  (Powell 
1981).  All  of  the  neurons  in  layer  4  that  fall  within  a  cylinder  with  a  radius 
of  about  0.3  mm  will  have  overlapping  dendritic  trees.  The  number  of 
neurons  that  overlap  with  the  reference  cell  is  thus  7r(0.3)2(0.3)(mm3)  times 
(105)(cells/mm3),  or  approximately  8000  neurons;  within  this  population, 
a  number  of  distinct  neuronal  types  might  be  found.  Each  type  could  have 
different  input  patterns,  but  still  a  significant  fraction  of  the  population 
should  represent  essentially  the  same  information. 

In  addition  to  the  fact  that  overlapping  dendritic  trees  tends  to  define 
equivalence  classes  of  neurons  (here,  an  equivalence  class  would  be  all  of 
the  neurons  that  would  potentially  have  anatomical  access  to  a  particular 
set  of  axon  terminals),  a  given  axon  generally  arborizes  over  a  considerable 
region  of  cortex  with  an  arbor  diameter  of  perhaps  0.5  mm  (Martin  1984), 
and  forms  about  2000  boutons,  each  of  which  makes  one  or  two  synapses 
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(Freund  et  al.  1989).  Thus,  neurons  of  the  same  functional  class  and  in 
the  same  cortical  layer  share  nearly  the  same  potential  synaptic  inputs 
whenever  their  cell  bodies  are  separated  by  several  hundred  microns  or 
less,  and  the  degree  of  similarity  in  their  inputs  increases  as  the  distance 
between  cell  bodies  decreases.  In  cat  primary  visual  cortex,  about  one 
third  of  the  neurons  whose  receptive  field  overlaps  with  that  of  a  particular 
geniculate  neuron  receive  input  from  the  geniculate  cell  (Tanaka  1983). 

Altogether,  these  observations — together  with  the  stochastic  nature  of 
neuronal  behavior— suggest  that  the  physiologically  meaningful  signal 
from  cortex  should  be  the  average  firing  rates  of  a  population  of  perhaps 
100  to  1000  neurons  near  a  particular  cortical  site.  The  behavior  of  cortex 
at  a  particular  point  would  then  be  described  by  the  firing  in  a  population 
of  neurons.  The  total  firing  that  represents  this  population  would  be  de¬ 
termined  by  a  weighted  average  of  the  appropriate  neurons  in  the  cortical 
region  that  surrounded  the  point,  perhaps  with  weights  that  are  described 
by  a  spatial  Gaussian.  As  one  moved  from  one  cortical  location  to  an  ad¬ 
jacent  one,  the  neuronal  population  whose  firing  defined  the  state  of  the 
new  cortical  point  would  overlap  with  the  previous  population  so  that  the 
variables  describing  cortical  state  would  vary  continuously  with  cortical 
position.  Features  of  cortical  structure  such  as  ocular  dominance  columns 
are  treated  with  a  straight  forward  extensions  of  these  notions  in  which  the 
cortex  is  viewed  as  interleaved  continuous  regions. 

The  third  requirement  for  a  theory  of  cortex,  then,  is  that  it  must  be  coarse¬ 
grained  and  treat  cortical  inputs  and  outputs  as  continuous  variables  that 
represent  the  summed  behavior  of  appropriately  sized  and  selected  neuron 
populations. 

Although  individual  neurons  behave  probabilistically,  if  the  population 
of  cells  needed  in  this  coarse-grained  description  were  sufficiently  large, 
a  deterministic  description  would  suffice.  Indeed,  deterministic  theories 
probably  will  be  adequate  for  many  purposes,  like  the  Hartline-Ratliff 
equation  described  below.  But  in  certain  situations — the  treatment  of 
activity-dependent  modification  of  neuronal  circuit  connections  discussed 
later,  for  example — the  stochastic  nature  of  brain  operation  will  have  to  be 
treated  explicitly.  Furthermore,  some  of  the  essential  calculations  made  by 
cortex,  like  the  computation  of  cross-correlation  functions,  may  well  turn 
out  to  require  a  probabilistic  description. 

Finally,  the  prominent  recurrent  nature  of  lateral  intracortical  connec¬ 
tions  and  relatively  wide  spatial  distribution  of  cortical  inputs  mean  that 
the  cortical  output  at  any  one  location  must  depend  on  both  the  input  and 
output  over  relatively  great  expanses  of  cortex  (for  example,  Gilbert  and 
Wiesel  1979).  That  is,  the  output  at  any  one  point  must  be  a  functional 
of  both  inputs  and  outputs  (for  a  brief  description  of  functionals  and  the 
relevant  literature,  see  Stevens  1987).  This  is  the  fourth  requirement. 

What  are  the  chances  that  these  are  the  right  four  requirements  for  a  start 
on  a  theory  for  cortex?  Not  great,  of  course,  but  little  explicit  attention  has 
been  given  to  the  types  of  theories  we  should  attempt,  and  this  issue  seems 


243 


What  Form  Should  a  Cortical  Theory  Take? 


to  be  an  important  one:  the  requirements  selected  should  be  examined  and 
debated,  alternatives  explored,  and  questions  raised  should  be  addressed 
by  experiments.  If,  for  example,  theories  that  use  continuous  mathematics 
are  unsuitable,  that  would  be  important  information. 

SIMPLEST  VERSION  OF  THE  GENERAL  APPROACH 


According  to  the  requirements  outlined  above,  the  goal  of  a  cortical  theory 
is  to  develop  an  equation,  using  continuous  mathematics,  for  the  proba¬ 
bility  functional  of  cortical  outputs  given  the  inputs.  For  simplicity  in  this 
initial  description,  the  cortex  will  be  considered  to  be  one-dimensional, 
with  cortical  position  specified  by  the  variable  x;  the  temporal  coordinate 
will  be  suppressed.  Further,  only  a  single  input  s(x )  and  a  single  output 
f(x)  will  be  used  for  this  introductory  treatment.  The  input  and  output  at 
position  X  represent  the  firing  rates  of  input  and  output  neurons  averaged 
over  a  small  cortical  volume.  What  is  needed  to  describe  this  simple  cortex, 
then,  is  the  functional  P[f(x);  s(x)]  that  specifies  the  probability  of  finding 
the  output  f(x)  (note  that  this  function  represents  the  entire  output  of  the 
cortex)  given  that  the  input  of  the  cortex  is  described  by  the  function  s(.t). 
This  sort  of  formulation  meets  the  four  requirements:  The  macroscopic 
nature  of  the  theory  comes  in  the  coarse-grained  treatment  used  to  define 
input  and  output,  the  inputs  and  outputs  are  treated  as  continuous  func¬ 
tions,  the  requirement  for  a  probabilistic  formulation  is  explicit,  and  the 
effects  of  lateral  connections  reside  in  the  fact  that  P  is  a  functional. 

A  starting  point  for  a  useful  description  of  cortex  is  the  identity 
P]f(x);  s(x)]  =  e~sW 


where  S\f,  s]  =  -In (P\f(x);  s(x)]),  by  definition.  The  motivation  for  this 
definition  is  as  follows:  Under  some  circumstances,  for  example,  an  input 
s(x)  that  varies  only  very  slightly  across  the  cortex,  a  functional  power  se¬ 
ries  expansion  in  the  arguments  f(x)  and  s(x)  should  approach  an  accurate 
representation  of  cortical  function.  By  transforming  P  is  this  way,  we  de¬ 
fine  a  slowly  varying  functional  that  is  more  amenable  to  a  power  series 
approach;  this  is  the  case  we  want  in  the  limit  of  inputs  that  are  sufficiently 
close  to  a  constant. 

The  functional  S  contains  a  complete  description  of  the  cortical  operation. 
Cortical  processing,  however,  occurs  at  specific  locations,  so  S  needs  to  be 
recast  in  a  form  that  makes  the  local  nature  of  the  description  explicit. 
Expand  S  in  a  functional  power  series  (Volterra  1959) 


S\f;  s]  =  A0[s]  +  J  dxA\  [s;  x]f(x)  dx  +  J  J  dxdx'A2[s;  x,  x/]/‘(x)/(x/)  +  •  •  • 


where  the  A*[s,x,x'...]  are  functionals  of  s(x)  and  functions  of  x.  Now 
rearrange  the  terms  so  that 


S\f;s]  =  J  L\f(x);  s(x),x]dx 
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where  L  is  defined  as 

Uf:  s, x]  =  f{x)  A\(x)  +  J dx’A2[s;  x, x'y(x')  +  ■■■ 

Because  L  provides  a  local  specification  of  cortical  function,  it  will  be  called 
the  cortical  characterization  functional.  Note  that  the  Aq  functional  has  been 
excluded  because  this  term  would  appear  in  the  normalization  of  the  prob¬ 
ability  functional.  Now  the  probability  functional  that  describes  the  cortex 
is  just 

P\f(x);  s(x)]  ~  e~  I  Lff’s’x]dx 

and  the  job  of  a  cortical  theory  is  to  identify  the  cortical  characterization 
functional  L.  So  far,  of  course,  the  only  physical  content  has  been  the  four 
initial  requirements  in  addition  to  the  notion  of  a  formulation  in  terms  of 
local  cortical  properties.  The  extent  to  which  this  sort  of  formulation  is 
useful  depends  on  developing  ways  to  determine  L. 

THE  SIMPLEST  CORTICAL  CHARACTERIZATION  FUNCTIONAL 


An  Approximate  Cortical  Characterization  Functional 


The  preceding  formulation  was  designed  for  a  power  series  expansion 
approach.  The  simplest  way  to  get  closer  to  L,  then,  is  to  expand  it  in  a 
functional  power  series  and  discard  the  higher  order  terms.  Because  the 
functionals  to  be  expanded  depend  on  both  /  (x)  and  s(x),  it  is  easier  to 
keep  things  straight  if  S[f;  s]  itself,  rather  than  L,  is  expanded;  L  can  then 
be  recognized  in  the  resulting  expressions.  Expand  S  to  second  order  in 
both  of  its  arguments: 


S[f;  s]  =  Aq 


J  dxA]  (x)f(x)  +  J  dxA2(x)s(x) 
\  J  J  dxdx'K(x,  x')f(x)f(x') 

2  J  J  dxdx'M(x,  x')f(x)s(x') 

5  J  J  dxdx'A^x,  x')s(x)s(x') 


here  the  Ak,  K,  and  M  are  functions  that  arise  in  the  Volterra  expansion. 
Each  of  the  A*  terms  vanishes:  they  contribute  to  the  probability  of  an 
output /(x)  without  depending  on  that  function,  so  they  are  included  in 
the  normalization  of  the  probability.  The  final  expression,  to  second  order, 
for  the  probability  functional  P  is  thus 
P\f(x);  s(x)]  ~  H  /  /  dxilx'f(x)\f(x')K(x-x')+s(x')M(x-x')] 


Note  the  additional  assumption  that  the  cortical  circuitry  is  spatially  uni¬ 
form  across  the  particular  cortical  region  being  treated  so  that  the  integrals 
are  convolutions.  The  cortical  characterization  functional  L  can  be  identi- 
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fied  in  this  approximation  as 

L\f ;  s,  x]  =  ~  J  dx'  \f(x')K(x  -  x')  +  s(x')M(x  -  *')] 

The  equation  for  P  defines  the  probability  of  any  output  for  a  given 
input;  K  and  M  are  functions  that  arise  in  the  Volterra  expansion.  In  this 
approximation  describing  cortical  operation  involves  discovering  the  form 
of  two  functions,  K  and  M. 

Experiments  generally  measure  the  average  output  to  a  particular  stim¬ 
ulus,  so  an  expression  for  the  average  response  must  be  extracted  from  the 
equation  to  provide  a  link  between  theory  and  experiment. 

Because  probabilities  are  specified,  the  equation  for  P  also  describes 
the  output  noise  for  no  (or  constant)  input.  In  this  preliminary  treatment 
the  time  dependence  of  inputs  and  outputs  has  been  suppressed,  so  the 
resting  output  noise  would  be  spatial  variations  in  neuronal  firing  f(x) 
predicted  for  an  unstimulated  cortex.  The  statistics  of  these  resting  output 
fluctuations  refer  to  an  hypothetical  ensemble  of  identical  corticies.  The 
average  response  and  resting  fluctuations  are  considered  in  turn. 

Average  Response 

The  average  response /(x)  for  a  given  s(x)  is  found,  by  definition,  from  the 
functional  integral 

/(*)  =  J  Vfe~SdxLf(x) 

Here  Vf  is  a  volume  element  in  function  space  and  is  related  to  the  Wiener 
measure  (see  the  appendix  of  Stevens  1987  and  the  references  cited  there). 
Although  this  functional  integration  can  be  carried  out  for  the  simple  L 
produced  by  the  power  series  approach,  an  easier  way  to  the  desired 
result — and  one  that  works  in  a  wider  variety  of  situations — is  to  find 
not  the  average,  but  rather  the  most  probable  response.  In  this  particular 
situation,  the  average  and  the  most  likely  responses  happen  to  be  iden¬ 
tical  (because,  as  will  be  seen,  the  fluctuations  are  Gaussian).  The  most 
likely  response  is  found  by  discovering  the  output/(x)  that  maximizes  the 

e~  JdxL  term,  which  is  equivalent  to  finding  the/(x)  that  minimizes  f  dxL. 
The  extremum  is  found  from  the  Euler-Lagrange  equation 

J L]f;s,x]dx  =  0 

The  functional  differentiation  gives,  for  the  simple  L  obtained  above, 
jbj  JdxL  =  Jdx  [2 /fam  -x)  +  s(x)M({  -  X)]  =  0 
so  that 

2/  J  dxf(x)K{£  -  x)  =  -  J  dxs(x)M(Z  -  x) 
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This  equation  relates  the  most  likely  response  /(x)  to  the  cortical  input 
s(x);  note  that  both  the  functions  K  and  M,  which  arise  in  the  Volterra 
expansion,  are  needed  to  determine  the  response,  and  also  that  expansion 
of  L  to  second  order  gives  a  linear  relation  between  input  and  average 
output. 

Since  the  most  likely  response  /  is  described  by  linear  equations,  the 
appropriate  characterization  of  a  cortex  in  this  limit  is  the  Green  function, 
defined  to  be  H(x);  this  function  would,  of  course,  define  the  receptive 
field  structure  of  sensory  cortical  neurons.  The  Green  function  satisfies  the 
equation  [because  s(x)  would  be  taken  to  be  a  delta  function] 

J  dxH(x)K(Z  ~x)  =  -M(0 

so  that  (take  the  Fourier  transform) 

where  ^_1{  }  denotes  the  inverse  Fourier  transform  and  the  tilde  indicates 
the  Fourier  transformed  function. 

Fluctuations 

In  addition  to  the  average  response,  our  probabilistic  formalism  also  de¬ 
scribes  the  cortex's  random  output  fluctuations  in  the  absence  of  an  input. 
If  s(x)  =  0,  the  basic  equation  reduces  to 

P\f(x)-,  0]  =  e-tf  fdxdx'  fWx')K(x-x') 

This  equation  describes  a  Gaussian  random  process  (Feynman  and  Hibbs 
1965)  with  a  covariance  function  C(x)  that  is  the  functional  inverse  of  K; 
that  is, 

dx'K(xf)C(x  -  x')  =  6(x) 

specifically,  the  covariance  function  is  given  by  (take  Fourier  transforms  of 
the  preceding  equation) 

C(x)  =  F-'{K~1} 

Thus,  the  spatial  fluctuations  in  the  output  should  be  Gaussian,  and  the 
statistical  structure  of  these  fluctuations  is  specified  by  the  inverse  of  the 
function  K. 

Because  the  Green  function  and  the  covariance  both  involve  K,  the  struc¬ 
ture  of  the  spontaneous  fluctuations  and  the  driven  response  are  related 
by  a  sort  of  fluctuation-dissipation  theorem.  Specifically,  if  K  is  eliminated 
between  the  equations  for  the  Green  function  and  the  covariance,  the  result 
is 

H(x)  =  j  d£M(QC(x-0 


247 


What  Form  Should  a  Cortical  Theory  Take? 


This  equation  relates  the  average  evoked  response  (through  the  Green  func¬ 
tion  H)  to  the  spontaneous  output  fluctuations  about  the  mean  (through 
the  covariance  function  for  the  fluctuations  C). 

A  Specific  "Cortex" 

How  might  the  functions  K  and  M  be  determined?  These  functions  are,  of 
course,  dependent  on  the  precise  nature  of  the  neural  circuits  that  are  being 
described  and  information  about  them  must  ultimately  come  from  obser¬ 
vations  on  cortical  structure  and  function.  The  formulation  developed  here 
should  apply  to  any  essentially  cortex-like  network  with  a  well-defined  in¬ 
put  and  output.  In  particular,  it  should  apply  to  the  (one-dimensional) 
Limulus  eye,  a  neuronal  system  whose  descriptive  equation  is  already 
known  (Ratliff  1965).  For  the  Limulus  eye,  the  function  M  that  describes 
the  distribution  of  the  input  (light)  should  be  a  delta  function,  so  that  the 
term  -  j  dxs(x)M(£  -x)-  ms(f),  for  some  constant  m.  An  ommatidium  at 
position  f  in  the  eye  is  excited  according  to  the  input  at  that  location  [ms(0] 
and  is  subject  to  lateral  inhibition  by  the  surrounding  cells.  The  magnitude 
of  the  inhibition  at  location  £  depends  on  the  response  of  the  inhibiting  cell 
at  x  and  on  its  distance  (x  -  £)  from  the  neuron  being  inhibited  according 
to  the  function  G(x  —  £).  This  means  that  the  function  K  is  given  by 
K(x-0  =  6(x-0~G(x-0 

with  G  specifying  the  lateral  inhibitory  interactions,  so  that  the  equation 
describing  the  eye's  behavior  would  be  the  Hartline-Ratliff  equation 

f(6  =  ms(0  -  J  dxj{x)G(x  -  £) 

Finally,  anatomical  constraints  would  make  G  a  Gaussian:  the  distribution 
of  inhibition  in  Limulus  eye  is  rotationally  symmetric  and  the  formation  of 
inhibitory  connections  in  the  x  and  y  directions  should  be  independent;  the 
functional  equation  that  results  from  these  constraints  has  a  Gauss  function 
as  its  solution.  In  summary,  the  structure  of  the  Limulus  eye,  together  with 
the  existence  of  rotationally  symmetric  lateral  inhibition,  serves  to  define 
the  functions  K  and  M,  and  the  formalism  then  yields  the  Hartline-Ratliff 
equation. 

For  the  special  case  of  M(x)  =  m8(x)  (this  is  the  situation  for  the  Hartline- 
Ratliff  equation),  the  Green  function  is  just  proportional  to  the  covariance; 
this  is  the  usual  fluctuation-dissipation  theorem. 

Fluctuations  in  Development 

The  relation  between  the  Green  function  and  the  covariance — this  relation 
is  a  natural  consequence  of  the  inherently  probabilistic  nature  of  synap¬ 
tic  transmission — could  be  of  importance  for  brain  development  (Mas- 
tronarde  1983).  Activity-dependent  modification  of  neural  circuits  has 
been  proposed  to  be  critical  for  the  shaping  of  the  final  pattern  of  con- 
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nections  that  determines  what  computation  a  circuit  performs  (Stent  1973; 
Changeux  1976;  see  Shatz  1990).  The  existence  of  these  fluctuations  at  one 
level,  with  a  correlation  function  that  is  related  to  the  receptive  field  struc¬ 
ture,  would  thus  provide  the  appropriate  activity  during  development 
in  utero  (when  patterned  input  to  cortex  related  to  external  stimulation 
should  be  minimal)  for  the  selforganization  of  circuits  at  the  next  level. 

EXTENSIONS  OF  THE  SIMPLEST  CASE 

In  the  preceding  sections,  the  cortex  treated  was  one-dimensional,  the  time 
dependence  of  its  operations  was  suppressed,  cortical  computations  were 
supposed  to  be  linear  (functional  expansion  of  L  to  second  order),  and 
only  a  single  input  and  output  cell  type  was  permitted.  The  goal  now  is  to 
remove  these  restrictions. 

Including  Additional  Coordinates  (Space  and  Time) 

The  extension  to  include  two  spatial  dimensions  and  time  is  immediate; 
the  cortical  characterization  functional  now  depends  on  two  spatial  dimen¬ 
sions  (specified  by  the  vector  x)  and  on  time  ( t ): 

P\f(x,ty,s(x,t)]=e-fi2xfdtLV:s’x’t] 

so  that  L  becomes  a  functional  of  /(x)  and  s(x)  and  a  function  of  x  and  t. 
When  L  is  approximated  by  carrying  out  a  Volterra  expansion  and  neglect¬ 
ing  terms  higher  than  second  order,  the  probability  functional  becomes 

P\f(x  t)m  s(x  t)]  =  e~%1  / 

The  specialization  of  this  equation  to  the  Limulus  eye  gives 

/(x,  t)  =  ms(x,  t)  —  J  d2x '  J  dt  G(x  -  x',  t  —  f')/(x/>  0 

with  the  inhibitory  influence  function  G  known  from  experiment  to  be 

G(x,o  =  forflxVM 

k,  a ,  and  b  are  constants. 

A  Nonlinear  Cortex 

Up  to  this  point  the  Volterra  expansion  has  been  carried  out  to  second  or¬ 
der  in  both/  and  s.  Expansion  to  this  order  in  the  output  f(x)  means  the 
fluctuations  are  described  by  a  Gaussian  process,  whereas  the  s(x)  expan¬ 
sion  relates  mean  output  to  input  by  a  linear  operator.  One  might  expect 
fluctuations  still  to  be  Gaussian  even  for  a  nonlinear  cortex.  To  treat  this 
case,  the  Volterra  expansion  must  be  carried  out  to  higher  order  in  s — here 
third  order  is  used  as  an  example,  although  any  order  is  possible — but 
second  order  in/  to  maintain  Gaussian  fluctuations  around  the  mean.  If 
fluctuations  happened  to  be  non-Gaussian,  the  expansion  could  be  carried 
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out  to  higher  order  terms  in/,  but  this  would  entail  severe  mathematical 
difficulties. 

To  simplify  the  equations,  the  cortex  will  again  be  one-dimensional  and 
the  following  convention  will  be  adopted:  what  was  represented  earlier  as 
integrals,  for  example, 

J  J  dxdx' K(x,x')f(x)f(x') 

will  now  be  expressed  in  operator  notation:  Kf2.  A  third-order  term  in¬ 
volving  an  operator  R  would  be 

Rfi2  =  J  J  J  dxdydzR(x,y,z)f(x)s(y)s(z); 

often  the  integrals  will  be  convolutions,  but  they  need  not  necessarily  be. 
With  this  notation,  a  Volterra  expansion  of  S\f  ;  s]  to  second  order  in/  and 
third  order  in  s  is 

S[f;  s]  =  -  Mfs  +  Afs 2  +  \Bf2s 

where  A  and  B  are  operators  that  arise  in  the  functional  expansion,  and 
terms  that  vanish  into  the  normalization  have  not  been  included.  Carry 
out  the  integration  over  one  of  the  variables  in  the  last  two  terms  and  define 
two  new  operators: 

B'  =  Bs 

A'  =  As 

note  that  A'  and  B'  are  functionals  of  s.  With  this  notation,  S  can  be  written 

S\f;s]=\(K-B')f  -(M-A')s 

Again, 

p^s]  =  e-\(K-B')f2+(M-A')s 

represents  a  Gaussian  process  with  a  covariance  function 
C  -  (K  ~  B')-1 

The  inverse  here  is  in  the  functional  sense,  and  the  covariance  now  is  a 
functional  of  the  input  s,  even  if  the  input  is  uniform  across  the  cortex. 
Thus,  for  a  nonlinear  cortex,  the  spatial  correlations  would  change  with 
the  input's  magnitude  and  pattern.  As  before,  the  most  likely  response/ 
to  a  given  input  s  can  be  found  through  the  Euler-Lagrange  equations  and 
is 

/=  [(M  —  i4')C]  s 

The  impulse  response,  therefore,  is  just 
H  =  (M-  A')C 

this  is  a  generalized  fluctuation-dissipation  relation,  but  now  the  impulse 
response  (called  the  Green  function  in  the  earlier  discussion  of  the  linear 
cortex)  is  more  complicated  because  Af  and  C  are  functionals  of  the  input  s. 
This  would  mean  that  the  receptive  field  structure  of  a  nonlinear  sensory 
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cortex  could  vary  according  to  how  it  was  measured  (delta  function  input 
vs.  noise,  for  example). 

Multiple  Inputs  and  Outputs 


For  simplicity,  the  treatment  so  far  has  considered  only  a  single  input  and 
output.  Here  the  theoretical  framework  for  a  cortical  theory  is  extended  to 
include  multiple  inputs  and  outputs.  For  example,  primary  visual  cortex 
would  require  at  least  on-  and  off-center  inputs  for  color  coded  cells,  x 
and  y  cells,  and  left  and  right  eye  cells;  the  total  number  of  inputs  would 
thus  be  between  one  and  two  dozen.  Because  color  and  parvocellular  and 
magnocellular  pathways  (each  with  on-  and  off-center  varieties)  project 
separately  from  primary  visual  cortex,  at  least  about  the  same  numbers  of 
outputs  would  be  necessary. 

Use  of  the  functional  Fourier  transform  simplifies  a  treatment  of  multiple 
inputs  and  outputs.  An  N-dimensional  functional  Fourier  transform  of  a 
probability  functional  P[f]  is  defined  by 

$[w  ]  =  J  DN/P[f]<ri/f'wd\ 

where  f  is  an  N-vector  of  input  functions  and  w  is  a  vector  of  transform 
functions.  The  input  and  output  functions  are  treated  here  as  if  they  de¬ 
pended  only  on  a  single  spatial  variable,  but  generalization  to  two  spatial 
variables  and  time  is  immediate.  Note  that  $,  known  as  the  characteristic 
functional,  depends  on  the  vector  of  transform  functions  w.  The  character¬ 
istic  functional  is  especially  useful  because  the  mean  output,  covariance  of 
the  output,  etc.  can  be  found  from  it  immediately.  For  example. 


to;(OJw=0 


yv/^(e)p[f] 


which  is,  by  definition,  the  mean fj(Q  of  the/th  output.  Similarly, 


62$ 

6wj(06wk(?)  Jw=0 


/  vNf 


the  covariance  function  (assuming,  for  simplicity,  a  zero  mean).  Thus, 
moments  of  the  outputs  are  readily  found  if  $  is  known;  fortunately,  the 
characteristic  functional  for  a  Gaussian  process,  the  type  of  process  that 
results  when  the  cortical  characterization  functional  is  expanded  to  second 
order  in  the  outputs,  is  not  difficult  to  calculate  for  multiple  output  and 
inputs  functions. 

What  is  required  is  a  generalization  of  the  single  input-output  cortical 
characterization  functional 


L[f;  s,  x]  =  tj*  f  dx'  \f(x')K(x  -  x')  +  s(x')M(x  -  x')] 

The  functional  L  can  be  written  in  the  shorthand  operator  notation  em¬ 
ployed  above  as 

L\f's,x]=f-^-  [Kf  +  Ms] 
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and  the  S  functional,  in  this  same  notation,  is 
S\f-,x]=l-{Kf2+Mfs\ 

To  generalize  to  multiple  inputs-outputs,  consider  vectors  of  output  and 
input  functions  f  and  s,  and  matrices  K  and  M  that  contain  operators.  For 
example^  would  be  the  output  from  the/th  class  of  cortical  neuron  and  the;, 
kth  operator  (matrix  element)  would  be  interpreted,  for  a  one-dimensional 
cortex  with  multiple  inputs  and  outputs,  as 

fiKikfk  =  J  J  dxdx'  Kjk(x  -  x')fj(x)fk(x') 

The  S  functional  for  such  a  cortex  is  just 
S[ f;  s]  =  i[fKf  +  sMf] 
and  the  probability  functional  is 
P[f;  s]  =  e~z{iKUsM(] 


The  problem  is  to  identify  the  covariance  functions  and  mean  associated 
with  the  probability  functional. 

The  starting  place  is  the  corresponding  characteristic  functional 


e~\[(K(  +sMf]  — f  J  dxfw 


The  idea  is  to  simplify  this  expression  so  that  covariance  and  mean  are 
apparent.  This  is  done  by  completing  squares.  That  is,  the  vector  of  output 
functions  f  is  transformed  to  a  new  vector  in  such  a  way  that  the  result 
contains  only  linear  and  squared  terms  in  the  transform  variable  w  that 
can  be  immediately  identified. 

For  a  vector  of  constants  a,  transform  f  according  to  f  — ►  f'  =  f  -  a.  When 
this  change  of  variables  is  made,  some  terms  in  the  expression  for  $[w] 
contain  only  f'  and  the  operator  matrix  K ,  some  do  not  contain  the  vector  of 
functions  f',  and  some  —  the  cross-terms  —  contain  combinations  of  {'  with 
s,  a,  and  w.  The  vector  a  can  be  chosen  to  make  these  cross-terms  vanish,  a 
condition  that  makes  a  =  -(iw  -  sM)C,  where  C  =  K"1.  The  combinations 
of  elements  in  C  that  arise  from  K  are  most  easily  computed,  when  cortex 
is  uniform  (so  convolutions  can  be  used),  with  Fourier  transforms.  For 
example.  Coo  might  be  given  (as  it  is  for  a  cortex  with  just  two  outputs) 
by  Ku/(KooKn  -  K^);  this  would  mean  that  Coo  is  found  from  the  inverse 
Fourier  transform  of  the  Fourier  transformed  K  entries  in  the  algebraic 
expression.  The  result  of  eliminating  the  cross  terms  by  the  appropriate 
selection  of  a  is 


<5>[w]  =  j[wCw— 2isMCw+sMCMs] 


(Note  that  VNf'  =  VNf  because  a  is  a  constant  with  respect  to  differential.) 
The  integral  and  the  exp(-isMCMs)  term  vanish  in  the  normalization  so 
the  final  expression  for  the  characteristic  functional  is 


$[w]  =  2  [wCw~2isMCw] 
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The  mean  can  be  recognized  as 
f  =  sMC 

and  the  covariance  functional  is  given  by  C.  Thus,  the  same  formalism  can 
be  easily  generalized  to  corticies  with  multiple  inputs  and  outputs,  and  a 
generalized  fluctuation-dissipation  relations  still  holds. 

HOW  CAN  THE  CORTICAL  CHARACTERIZATION  FUNCTIONAL  BE 
FOUND? 

If  the  formalism  described  here  is  an  appropriate  one,  then  the  job  of  a  cor¬ 
tical  theorist  is  to  specify  the  cortical  characterization  functional.  A  power 
series  approach  gives  the  form  of  the  equations,  but  the  number  of  inputs 
and  outputs,  and  the  nature  of  the  functions  that  appear  as  a  result  of  the 
Volterra  expansion  must  be  determined  by  biological  and  other  constraints. 
In  the  simple  Limulus  eye  example,  anatomical  and  electrophysiological 
investigations  revealed  the  number  of  inputs  and  outputs  (one)  and  the 
existence  of  lateral  inhibitory  connections;  symmetry  conditions  identify 
the  inhibitory  influence  function  as  Gaussian. 

Doubtless  a  cortical  theory  will  involve  the  same  sort  of  anatomical  and 
physiological  information  combined  with  general  constraints.  Analysis 
of  receptive  field  structure  will  partly  specify  the  unknown  functions  (for 
example,  Reid  et  al.  1991),  but  additional  constraints — derived  from  prin¬ 
ciples  like  minimal  redundancy  (Atick  and  Redlich  1992),  scale  invariance, 
and  other  symmetries — will  probably  be  required  as  well  to  fill  in  gaps  left 
by  incomplete  information  about  cortical  structure  and  function.  The  ini¬ 
tial  attempts  may  have  to  restrict  the  problem  in  some  ways,  for  example, 
by  considering  an  appropriately  chosen  subset  of  the  inputs  and  outputs 
and  by  dealing  with  only  a  single  cortical  layer  or  sublayer  (like  primary 
visual  cortex  layer  4). 

An  alternative  approach  is  to  postulate  the  general  nature  of  the  cortical 
computation.  For  example,  one  might  suppose  that  the  job  of  cortex  is  to 
solve  an  underconstrained  inverse  problem  (Poggio  et  al.  1985).  Consider, 
again  for  simplicity,  a  one-dimensional  cortex  with  time  suppressed,  and 
suppose  that  the  desired  output /(x)  is  the  one  that  minimizes  (Af  —  Bs)2 
over  the  entire  cortex  for  linear  operators  A  and  B.  This  problem  might  be 
ill-posed  so  that  it  must  be  regularized  by  adding  ( Rf  )2  for  another  linear 
operator  R.  The  computation  made  by  cortex  then  is  to  find  the  output/ 
that  satisfies  the  equation 

JLJdx[(Af-Bsf  +  (Rf)2]=  0 

If  the  S  functional  is  taken  to  be 

S\f;  s]  =  J  dx[(Af-Bs)2  +  (Rf)2} 

then  the  most  likely  response/(x)  will  be  the  one  that  the  cortex  is  supposed 
to  compute.  This  means  that  the  cortical  characterization  functional  L  can 
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be  immediately  identified  for  this  case  as 
L\f;s,x)  =  (Af-Bs)2  +  (Rf)2 

Examination  of  this  last  relation  reveals  that  it  has  just  of  the  same  form  as 
the  expression  for  L  developed  earlier  with  the  power  series  approach  when 
the  expansion  was  carried  out  to  second  order  in  both/  and  s.  Specifically, 
if  a  uniform  cortex  is  assumed  so  the  integral  operators  are  represented  by 
convolutions,  the  S  functional  will  be 

S\f;  s]  =  J I  Jdxd^dVA(x-OA(x-r])f(Om 

-2  III  dxd^driA(x  ~  ~  v)f(Os(v) 

+  J  J  J  dxdtf  riRix  -  £)R(x  -  mom 
+m2 

Now,  define 

K(£  -  r])  =  J  dxA(x  -  QA(x  -  77)  +  J  dxR(x  -  £)R(x  -  77) 
and 

M((  -  v)  =  J  dxA(x-mx~rj) 

With  these  definitions,  the  S  functional  is  written,  up  to  a  functional  that  is 
independent  of/  and  thus  vanishes  in  the  normalization  of  the  probability 
functional,  as 

si/;  s)  =  JJ  -  v)mm  -2  J  J  dsdVM(z  -  v)/mv) 

This  is,  of  course,  of  just  the  same  form  as  obtained  earlier  by  expansion  of 
S  to  second  order  in/  and  s.  If  the  inverse  problem  to  be  solved  involves 
just  linear  operators,  it  is  thus  equivalent  to  approximate  (second-order) 
theory  developed  above.  If  the  operators  are  nonlinear,  however,  then  the 
situation  would  be  more  complex  and  the  relationship  to  the  power  series 
approach  would  have  to  be  established  for  each  specific  case.  Whether  the 
operators  A,  B,  and  R  are  linear  or  nonlinear,  this  approach  permits  the 
cortical  characterization  functional  to  be  identified  immediately. 

To  reiterate,  the  goal  here  has  not  been  to  formulate  a  theory  of  cortex 
but  rather  to  identify  the  form  that  any  such  theory  should  take.  Insofar  as 
the  arguments  are  correct,  this  initial  step  in  considering  cortical  theories 
has  defined  the  problem  to  be  solved  (specify  the  cortical  characterization 
functional)  and  has  indicated  some  of  the  paths  that  might  be  followed  to 
do  this. 

The  real  challenge,  of  course,  is  to  decide  what  requirements  for  a  cortical 
theory  are  the  correct  ones  and  then  to  use  the  resulting  theoretical  frame¬ 
work  to  increase  our  understanding  of  how  the  brain  works.  Whether 
this  will  be  possible  is  by  no  means  obvious.  Nevertheless,  this  effort  is 
already  producing  beneficial  results:  our  laboratory  is  carrying  out  exper- 
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iments  designed  specifically  to  answer  questions  posed  in  the  discussion 
of  the  requirements  for  the  theory. 
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Sequence  Seeking  and  Counterstreams:  A 
Model  for  Bidirectional  Information  Flow  in 
the  Cortex 

Shimon  Ullman 


Considering  the  wide  range  of  functions  it  performs,  the  mammalian  neo¬ 
cortex  is  notably  uniform  in  structure.  Although  cytoarchitectonic  dif¬ 
ferences  exist  between  neocortical  areas  (e.g.,  the  striate  cortex  in  certain 
primates,  or  the  giant  Betz  cells  in  motor  cortex),  in  terms  of  laminar  or¬ 
ganization,  number  of  cells,  cell  types,  and  general  connectivity  patterns 
there  are  close  similarities  among  different  cortical  areas  in  the  same  ani¬ 
mal,  and  across  species  (Rockel  et  al.  1980;  Van  Essen  1985,  Martin  1988; 
White  1989).  In  the  words  of  Martin  (1988),  "it  would  take  an  expert  to 
distinguish  rat  frontal  cortex  from  sheep  parietal  cortex,  or  cat  auditory 
cortex  from  monkey  somatosensory  cortex." 

This  structural  uniformity  has  suggested  the  possibility  of  common  com¬ 
putational  principles  that  may  be  used,  with  suitable  local  variations, 
throughout  the  neocortex  (Creutzfeldt  1978;  Barlow  1985;  Crick  and 
Asanuma  1986;  Sejnowski  1986).  Several  proposals  have  been  made  re¬ 
garding  the  possible  general  operation  of  the  neocortex  (Marr  1970; 
Creutzfeldt  1978;  Edelman  1978;  Barlow  1985;  Grossberg  1988;  Mumford 
1991, 1992;  Poggio  1990;  Ullman  1991). 

In  this  chapter,  a  model  for  some  general  aspect  of  information  flow  in 
the  neocortex  is  proposed.  The  proposed  computation  is  quite  general  in 
nature,  but  the  focus  of  the  discussion  will  be  on  vision  and  the  visual 
cortex.  The  first  part  of  the  chapter  outlines  the  general  computation  pro¬ 
posed  by  the  model,  and  the  second  its  biological  implementation.  The 
model  is  used  to  account  for  known  features  of  cortical  circuitry,  and  to 
derive  a  number  of  new  predictions. 

SEQUENCE  SEEKING  AND  COUNTERSTREAMS 

A  general  task  frequently  faced  by  the  brain  is  one  of  establishing  a  link 
between  two  different  representations.  For  example,  in  visual  recognition, 
the  task  involves  establishing  a  connection  between  an  incoming  pattern 
and  stored  object  representations  in  visual  memory.  The  two  will  often 
fail  to  match  exactly,  due  to  changes  in  size,  position,  viewing  direction, 
etc.  A  common  view  is  therefore  that  prior  to  the  matching  the  input  is 
processed  through  a  sequence  of  stages  that  includes,  for  example,  edge 


detection,  extracting  features  of  varying  complexity,  normalization  for  size, 
position,  and  orientation.  The  model  below  modifies  this  general  view  in 
two  directions.  First,  it  proposes  a  bidirectional  search,  where  the  matching 
can  occur  at  intermediate  levels  rather  than  some  "topmost"'  level.  Second, 
rather  than  following  a  single  path,  multiple  processing  alternatives  are 
explored  in  parallel. 

Bidirectional  Search 

In  applying  a  sequence  of  transformations  to  match  an  incoming  pattern 
P  with  stored  patterns  M„  the  transformations  could  be  applied  to  P,  or  to 
M„  or  to  both.  A  simple  transformation,  such  as  overall  shift  or  scaling, 
is  best  applied  to  the  input  pattern,  because  then  it  will  be  applied  to  a 
single  pattern.  Other  transformations  are  specific  to  a  stored  model  (e.g., 
how  a  face  may  transform  by  facial  expressions)  and  cannot  be  applied  to 
the  image  in  a  "bottom-up"  manner.  An  attractive  solution  is  to  apply  a 
bidirectional  search,  which  is  also  economical  in  terms  of  the  number  of 
patterns  explored.  More  generally,  the  suggestion  is  to  use  two  streams 
of  processing,  an  ascending  one  starting  at  the  input,  and  a  descending 
one  starting  at  the  stored  models.  From  a  biological  standpoint,  these  will 
correspond  to  the  "forward"  and  "backward"  connections  between  cortical 
areas. 

Exploring  Multiple  Alternatives 

A  large  number  of  alternative  routes  may  have  to  be  explored  before  a  link  is 
successfully  established  between  a  "source"  and  a  "target"  representation. 
To  achieve  fast  computation,  it  will  be  necessary  to  explore  simultaneously 
a  large  number  of  alternative  routes.  In  many  models  of  visual  process¬ 
ing,  the  input  pattern  undergoes  a  single  sequence  of  processing  stages. 
In  contrast,  in  the  sequence-seeking  scheme  an  input  pattern  gives  rise  to 
multiple  sequences  of  transformations  and  mappings  that  are  explored  in 
parallel.  The  terms  "transformations"  and  "mappings"  should  be  taken 
here  in  a  broad  sense;  they  may  include  geometric  transformations  such  as 
changes  in  size,  position,  and  orientation,  the  recovery  of  different  prop¬ 
erties  such  as  color,  motion,  texture,  and  3D  shape,  as  well  as  exploring 
alternative  ways  of  representing  the  pattern  (e.g.,  in  terms  of  its  parts  and 
its  abstract  shape  properties). 

Linking  the  Ascending  and  Descending  Streams,  the  Counterstreams 
Structure 

The  bidirectional  processing  is  diagrammed  schematically  in  figure  12.1a. 
The  basic  operation  in  this  scheme  is  to  seek  a  sequence  of  processing  steps 
linking  a  source  pattern  (S  in  figure  12.1a)  in  one  area  with  stored  repre¬ 
sentations  (such  as  Mi,  M2)  in  another.  The  nodes  in  this  schematic  figure 
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represent  patterns  of  activity  (e.g.,  subpopulations  of  neurons  acting  to¬ 
gether,  possibly  with  some  degree  of  synchrony)  (Abeles  1991;  Engel  et  al. 
1992),  and  the  arrows  indicate  how  patterns  activate  subsequent  patterns 
(e.g.,  S  can  activate  A2,  A3,  and  A5).  Since  different  patterns  may  share 
neurons,  implementation  constraints  will  place  some  limitations  on  the 
coactivation  of  patterns;  for  example,  patterns  (B2/  B3,  B4)  may  be  prohib¬ 
ited  from  being  coactive.  In  expanding  the  sequence  down  from  Mi,  only 
a  subset  of  these  patterns  will  be  activated  initially,  and  will  later  decay 
and  be  replaced  by  others. 

The  search  is  bidirectional,  and  a  linking  sequence  is  successfully  estab¬ 
lished  when  the  two  searches  meet  somewhere  in  this  large  network  of 
interconnected  patterns.  How  can  a  successful  link  of  patterns  be  found 
by  the  system?  The  proposed  scheme  (figure  12.1b)  has  two  main  compo¬ 
nents.  First,  the  ascending  and  descending  streams  proceed  along  separate, 
complementary  pathways.  Second,  when  a  track  is  being  traversed  in  one 
stream,  it  is  assumed  to  leave  behind  a  primed  trace  in  the  complementary 
stream,  making  it  more  readily  excitable,  as  explained  further  below.  The 
scheme  shown  schematically  in  figure  12.1b  is  similar  to  that  shown  in  fig¬ 
ure  12.1a  except  that  each  node  is  now  split  into  two  complementary  nodes 
(e.g.,  B2  in  figure  12.1a  is  now  split  into  B2  on  the  ascending  pathway  and 
its  complementary  pattern  B2  on  the  descending  one). 

The  full  bidirectional  search  now  proceeds  as  follows.  A  number  of 
sequences  originating  at  S  begin  to  be  activated  along  the  ascending  path¬ 
way.  At  the  same  time,  sequences  originating  at  Mi  and  M2  begin  to 
expand  downward  along  the  descending  pathway.  Not  all  of  the  possi¬ 
ble  sequences  are  expanded  simultaneously,  but  whenever  a  track  (subse¬ 
quence)  is  being  traversed,  the  complementary  track  remains  in  a  primed 
state,  ready  to  be  activated.  Suppose  that  by  the  time  S  has  activated  A2 
along  the  ascending  stream,  the  track  Mi  — ►  B3  — >  A2  had  already  been 
traversed  in  the  descending  stream.  Due  to  the  primed  traces,  this  will 
result  in  the  immediate  activation  of  the  complete  sequences  S  —*  M\  and 
Mi  — >  S,  establishing  a  complete  link  between  the  source  and  target  pat¬ 
terns.  This  will  also  select  Mi  as  the  stored  pattern  corresponding  to  the 
input  image  S.  (A  selection  among  models  may  be  required  if  more  than  a 
single  model  is  matched  with  the  input.) 

Two  properties  of  this  linking  process  are  worth  noting.  First,  a  link 
between  the  ascending  and  descending  streams  can  take  place  at  any  level. 
Second,  to  establish  a  link,  the  ascending  and  descending  patterns  need  not 
arrive  at  a  given  node  simultaneously;  a  meeting  is  also  possible  between 
an  active  pattern  and  a  pattern  that  had  been  active  some  time  before  and 
decayed,  but  left  a  primed  trace  in  the  complementary  stream. 

In  terms  of  connectivity,  the  excitatory  connections  between  patterns  are 
reciprocal,  obeying  the  following  general  rule:  whenever  A  is  connected 
to  B,  there  is  a  back-connection  from  B  to  A,  with  cross-connections  (of 
the  priming  type)  between  A  and  A  and  B  and  B  (figure  12.1c).  The  reci¬ 
procity  of  the  connections  is  an  inherent  aspect  of  the  model,  and  it  is  also 


259 


Sequence  Seeking  and  Counterstreams 


Figure  12.1  (<?)  The  sequence-seeking  computation  seeks  a  sequence  of  mappings  linking  a 
source  pattern  (S)  in  one  area  with  stored  representations  (Mj,  M2)  in  another.  Nodes  rep¬ 
resent  patterns  of  activity  and  arrows  indicate  how  patterns  activate  subsequent  patterns. 
In  expanding  sequences  only  a  subset  of  patterns  will  be  activated  initially,  and  will  later 
decay  and  be  replaced  by  others.  The  processing  is  bidirectional,  and  a  linking  sequence 
is  successfully  established  when  the  two  searches  meet  somewhere  in  this  large  network  of 
interconnected  patterns.  ( b )  Similar  to  (a),  except  that  each  node  is  split  into  two  complemen¬ 
tary  ones.  The  ascending  and  descending  streams  proceed  along  complementary  pathways. 
When  a  track  is  being  traversed  in  one  stream,  it  leaves  behind  a  primed  trace  in  the  com¬ 
plementary  stream,  (c)  The  basic  unit  of  the  counterstreams  structure.  Patterns  A,  B  on  the 
ascending,  A,  B  on  the  descending  path.  Horizontal  arrows  denote  connections  of  the  priming 
type.  This  repeating  unit  is  embedded  in  a  network  of  richly  interconnected  patterns. 
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a  distinguishing  feature  of  cortical  connectivity.  Note  that  although  the 
counterstreams  structure  uses  "forward"  and  "backward"  connections,  it 
does  not  necessarily  imply  a  hierarchical  structure;  it  can  incorporate  a 
more  general  structure  as  long  as  the  above  connectivity  rule  is  obeyed. 
(Inhibitory  connections  also  play  a  role,  but  will  not  be  discussed.) 

The  basic  design  of  the  sequence-seeking  model  is  relatively  straightfor¬ 
ward,  comprising  two  complementary  networks  going  in  opposite  direc¬ 
tions,  with  interaction  between  them  primarily  (but  not  exclusively)  in  the 
form  of  enhancing  patterns  across  the  two  streams.  Compared  with  other 
models,  the  scheme  places  more  emphasis  on  the  parallel  exploration  and 
selection  of  multiple  alternatives,  rather  than  relaxation  and  iterative  com¬ 
putations.  Timing  considerations  (Maunsell  and  Gibson  1992;  Thorpe  et  al. 
1991;  Rolls  et  al.  1991)  appear  to  place  rather  stringent  restrictions  on  the 
use  of  multiiteration  relaxation  processes  in  tasks  such  as  visual  recogni¬ 
tion.  A  visual  cortical  area  may  introduce  an  average  delay  of  about  10-15 
msec,  and  there  are  about  six  stations  spanning  the  hierarchy  from  VI  to 
anterior  IT.  This  suggests  that  visual  processing  should  usually  require  a 
limited  number  of  sweeps  through  the  system.  It  is  desirable,  therefore, 
especially  for  a  highly  parallel  system,  to  explore  multiple  alternatives 
simultaneously,  rather  than  explore  and  refine  them  in  sequence. 

This  is  the  skeleton  of  the  computation,  a  number  of  elaborations  and 
properties  of  the  basic  process  are  discussed  below. 

Express  Lines 

In  expanding  the  descending  sequences,  how  can  the  initial  selection  of 
models  be  performed?  To  cut  down  the  number  of  competing  sequences 
in  the  descending  stream,  it  would  be  useful  to  expand  with  higher  prior¬ 
ity  alternatives  that  appear  more  promising.  For  example,  in  attempting 
to  recognize  an  object,  some  models  (a  face,  say)  may  become  more  likely 
than  others  on  the  basis  of  partial  analysis,  although  it  may  not  yet  be  pos¬ 
sible  to  identify  the  individual  face.  It  would  be  advantageous  under  these 
circumstances  to  expand  many  face-related  sequences,  possibly  at  the  ex¬ 
pense  of  others.  Such  an  effect  can  be  obtained  by  using  "express  lines," 
directly  connecting  low-level  to  higher-level  nodes  along  the  ascending 
stream.  The  express  lines  will  activate  (directly  or  indirectly)  patterns  on 
the  descending  path.  This  will  initiate  an  expansion  of  sequences  from 
the  selected  patterns.  This  selection  of  higher-level  patterns  can  be  viewed 
as  invoking  a  hypothesis  suggested  by  the  data,  but  which  has  yet  to  be 
confirmed.  A  link  to  the  ascending  stream  will  still  be  required  to  confirm 
the  hypothesis.  Note  that,  unlike  the  priming  interaction  between  streams, 
in  the  case  of  express  lines  the  ascending  stream  can  directly  activate  the 
descending  one.  Express  lines  could  also  use  inhibition  rather  than  facili¬ 
tation;  if  the  partially  expanded  sequences  in  the  ascending  stream  render 
some  higher-level  nodes  unlikely,  inhibitory  "express  lines"  could  be  used 
to  suppress  their  expansion  in  the  descending  stream. 
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The  express  lines  provide  one  mechanism  for  "indexing"  into  the  large 
number  of  models  stored  in  memory.  "Indexing"  is  a  term  used  in  com¬ 
putational  vision  for  the  initial  selection  of  a  general  class,  or  classes  of 
models,  that  might  correspond  to  the  input  image.  The  express  lines  play 
a  role  in  this  process  by  the  selection  of  likely  models  on  the  descending 
stream.  This  initial  selection  is  not  limited  to  the  activation  of  models  at 
a  single  "topmost"  level;  models  at  different  levels  along  the  descending 
stream  can  also  be  indexed  and  serve  as  the  starting  point  for  descending 
subsequences.  For  example,  in  addition  to  the  selection  of  a  complete  face 
model,  intermediate  models  of  face  parts  can  also  be  activated.  Anatomi¬ 
cally,  such  express  lines  may  correspond  to  direct  connections  from  low  to 
high  visual  areas  (such  as  the  connections  from  area  V4  to  AIT,  or  from  V3 
and  VP  to  area  TF;  Felleman  and  Van  Essen  1991). 

Another  mechanism  for  model  selection  is  provided  by  the  effects  of 
expectation  and  context.  Knowledge  about  the  current  situation  can  lead 
to  the  activation  or  priming  of  a  subset  of  models  that  will  then  become 
preferential  sources  for  descending  sequences.  The  set  of  active  models  will 
then  be  modified  and  refined  throughout  the  sequence-seeking  process,  as 
described  below. 

The  Effect  of  Context 

Context  can  have  a  powerful  influence  on  the  processing  of  visual  infor¬ 
mation  (as  well  as  in  other  perceptual  and  cognitive  domains).  A  pair  of 
similar,  elongated  blobs  in  the  image  may  be  ambiguous,  but  in  the  appro¬ 
priate  context  (e.g.,  under  the  bed)  they  may  be  immediately  recognized 
as  a  pair  of  slippers. 

Context  effects  can  operate  in  the  framework  of  the  sequence-seeking 
scheme  by  a  prior  priming  of  some  of  the  nodes.  The  effect  will  be  similar 
to  the  mutual  priming  of  the  ascending  and  descending  streams  but  with 
longer  time  scales.  (Priming  between  the  streams  may  last  for  tens  to 
hundreds  of  milliseconds,  context  effect  should  last  for  considerably  longer, 
up  to  minutes  or  hours.)  Sequences  passing  through  the  primed  nodes  will 
then  become  facilitated.  In  the  above  example,  the  location  of  the  blobs 
under  the  bed  will  prime  patterns  representing  objects  that  are  commonly 
found  in  that  location,  making  slippers  a  likely  interpretation. 

The  general  notion  of  priming  internal  representations  is  a  common  one, 
but  its  effects  in  the  framework  of  the  sequence-seeking  scheme  are  par¬ 
ticularly  broad.  When  certain  nodes  are  activated  (e.g.,  by  noticing  and 
identifying  the  bed  in  the  image)  they  will  initiate  sequences  of  their  own, 
and  an  entire  set  of  patterns  may  end  up  in  a  primed  state.  (Context  may 
possibly  involve  also  inhibitory  effects,  making  some  of  the  paths  less  fa¬ 
vorable.)  Later  on,  a  large  number  of  possible  sequences  passing  through  a 
primed  node  will  be  facilitated,  compared  with  the  nonprimed  sequences. 
Context  effects  are  thus  not  limited  to  directly  increasing  or  decreasing  the 
likelihood  of  a  single  match,  but  they  can  have  indirect,  widespread  effects. 
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by  facilitating  otherwise  less  favorable  sequences.  A  context  pattern  A  may 
help  to  bring  about  the  activation  of  B,  not  as  a  result  of  direct  association, 
but  because  A  may  have  a  sequence  leading  to  some  intermediate  pattern 
C,  and,  later  on,  an  activated  pattern  may  have  another  sequence  leading 
to  B  via  the  primed  pattern  C. 

These  general  characteristics  capture  some  of  the  fundamental  aspects 
of  context  effects  in  humans.  Humans'  perception  and  cognition  appear  to 
have  an  almost  uncanny  capacity  (which  is  extremely  difficult  to  reproduce 
in  artificial  systems)  for  bringing  in  relevant  context  information  in  a  broad 
and  flexible  manner.  It  seems  that  broad,  indirect,  context  effects  can  be 
reproduced  by  the  sequence-seeking  computation. 

Learning  Sequences 

A  simple  and  local  learning  rule  is  sufficient  in  the  counterstreams  struc¬ 
ture  to  reinforce  selectively  complete  successful  sequences.  Every  pattern 
node  in  a  successful  sequence  will  receive  both  a  direct  activation  and  a 
priming  signal  from  the  complementary  track,  while  patterns  on  dead-end 
tracks  will  receive  one  or  the  other  but  not  both.  The  approximate  temporal 
coincidence  of  the  two  signals  can  be  used  to  preferentially  strengthen  the 
successful  sequence.  This  rule  is  local,  since  it  depends  on  the  activation 
of  a  single  pattern.  Yet,  it  is  sufficient  to  reinforce  preferentially  successful 
sequences  forming  an  uninterrupted  link  between  source  and  target  pat¬ 
terns.  Following  practice,  out  of  the  huge  number  of  possible  sequences, 
those  that  proved  useful  in  the  past  will  be  explored  with  higher  priority 
is  future  use  of  the  network. 

In  the  process  of  reinforcing  successful  sequences,  changes  due  to  learn¬ 
ing  are  distributed  throughout  the  system,  and  are  not  confined  to  high 
level  centers  specializing  in  learning  (Sejnowski  1986).  Recent  studies  of 
learning  certain  perceptual  skills  suggest  that  low  level  visual  areas,  in¬ 
cluding  primary  visual  cortex,  are  indeed  involved  in  the  modifications 
that  take  place  during  the  learning  process  (Kami  and  Sagi  1991). 

In  addition  to  the  learning  of  complete  sequences,  as  above,  the  system 
may  also  be  engaged  in  the  learning  of  the  individual  mappings,  that  is, 
the  basic  steps  that  make  up  the  sequences  (Poggio  1990).  This  aspect  of 
the  learning  is,  however,  outside  the  scope  of  the  current  discussion,  since 
the  focus  is  not  on  the  specifics  of  individual  processes,  but  on  their  overall 
common  structure. 

Refining  the  Expansion 

The  matching  between  streams  is  not  an  all-or-nothing  event,  but  a  graded 
one:  some  sequences  will  lead  to  better  matches  than  others,  and  then 
serve  as  starting  points  for  exploring  additional  sequences,  that  will  lead 
in  turn  to  an  improved  match.  This  process  has  some  features  in  common 
with  a  family  of  optimization  and  search  procedures  known  as  "genetic  al- 
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gorithms"  (Holland  1975).  Recent  evaluations  have  shown  such  methods 
to  behave  quite  efficiently  (Brady  1985;  Peterson  1990).  Our  own  simula¬ 
tions  in  the  context  of  pattern  matching  have  also  shown  that  computations 
based  on  sequence-seeking  compare  favorably  with  alternative  methods, 
such  as  gradient  descent  and  simulated  annealing. 

General  Aspects  of  Sequence  Seeking 

The  discussion  above  of  the  sequence-seeking  process  used  as  an  example 
the  domain  on  visual  recognition.  However,  the  process  of  establishing  a 
sequence  of  transformations,  mappings,  or  states,  linking  source  and  tar¬ 
get  representations,  could  provide  a  useful  general  mechanism  for  various 
aspects  of  perception  as  well  as  for  nonperceptual  functions.  For  exam¬ 
ple,  the  planning  of  a  motor  action  can  be  cast  at  some  level  in  terms  of 
seeking  a  sequence  of  possible  moves  linking  an  initial  configuration  with 
a  desired  final  state.  Movement  trajectories  will  be  based  in  a  sequence¬ 
seeking  scheme  on  a  stored  repertoire  of  elementary  movements.  These 
basic  movements  could  then  be  transformed  (scaled,  stretched,  rotated, 
etc.)  and  concatenated  together  to  generate  more  complex  movements.  In 
analogy  with  sequence  seeking  in  vision,  motion  planning  will  involve  the 
application  of  transformation  and  the  generation  of  compound  sequences. 
Similarly,  more  general  planning  and  problem  solving  can  also  often  be  de¬ 
scribed  in  terms  of  establishing  a  sequence  of  transformations,  mappings, 
or  intermediate  states,  linking  some  source  and  target  representations.  The 
general  aspects  of  the  sequence-seeking  process  therefore  provide  a  useful 
computation  that  could  be  applied,  with  appropriate  modification,  to  a 
large  variety  of  different  tasks. 

BIOLOGICAL  EMBODIMENT 

The  sequence-seeking  model  requires  two  streams  going  in  opposite  di¬ 
rections  with  the  appropriate  cross-connections.  A  schematic  diagram 
proposing  how  the  counterstreams  structure  may  be  embedded  in  cor¬ 
tical  connections  is  shown  in  figure  12.2a.  The  proposed  implementation 
is  presented  in  schematic  outline  only,  focusing  on  a  number  of  central 
aspect,  and  without  discussing  details  or  possible  variations  of  the  model. 

The  ascending  stream  goes  through  layer  4  to  a  subpopulation  of  the 
superficial  layers,  denoted  in  the  figure  as  AS  (for  ascending  superficial), 
and  then  projects  to  layer  4  of  the  next  cortical  area  (II  in  the  figure).  The 
descending  stream  goes  through  a  different  subpopulation  of  the  superfi¬ 
cial  layers  (DS,  for  descending  superficial)  to  DI  (for  descending  infra),  a 
subpopulation  of  the  infragranular  layers  (often  in  layer  6),  and  from  there 
to  DS  of  a  preceding  area.  The  connections  can  also  leap  over  one  step  (or 
occasionally  more)  in  the  stream  (e.g.,  AS  directly  to  AS  on  the  ascending 
stream,  and  DS  — ♦  DS  or  6  — ►  6  on  the  descending  stream)  (thin  lines  in 
figure  12.2b). 
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Layer  5  is  left  out  of  the  diagram  because,  according  to  the  model,  it  (or 
a  part  of  it)  is  involved  primarily  not  in  the  main  streams,  but  with  their 
control,  via  subcortical  structures.  There  are  two  reasons  for  assuming  that 
layer  5  (or  parts  of  it,  e.g.,  5b  of  the  macaque's  VI)  may  be  involved  in  con¬ 
trol  functions.  First,  its  orderly  connections  to  subcortical  structures  (e.g., 
from  visual  cortex  to  the  pulvinar  and  the  superior  colliculus,  structures 
implicated  in  controlling  attention  and  eye  movements)  that  are  recipro¬ 
cally  connected  in  turn  in  a  topographic  manner  to  multiple  visual  areas. 
Second,  the  firing  pattern  of  a  population  of  pyramidal  cells  in  this  layer 
that  "can  initiate  synchronized  rhythms  and  project  them  on  neurons  in  all 
layers"  (Silva  et  al.  1991). 

Note  that  the  counterstreams  structure  suggests  a  natural  organization 
in  about  5-6  main  layers:  one  or  two  performing  control  functions,  two  (an 
input  and  an  output  layer)  for  the  ascending  and  two  for  the  descending 
streams.  The  division  be  tween  the  roles  of  the  different  layers  may  in  reality 
be  less  clear  cut,  however,  the  main  goal  of  the  diagram  is  to  emphasize 
the  common  underlying  structure  according  to  the  model,  rather  that  to 
account  for  possible  variations. 

Connections  of  VI:  Data  and  Predictions 

Figure  12.2b, c  shows  an  expanded  version  of  the  diagram,  applied  to  cor¬ 
tical  area  VI  (which  is  somewhat  special,  but  for  which  the  data  are  more 
comprehensive  than  for  other  visual  areas),  and  its  connections  to  the  LGN 
and  cortical  area  V2  (VI  is  also  connected  to  other  visual  areas,  not  shown 
in  the  diagram).  Figure  12.2b  shows  the  connections  in  the  macaque  of  the 
magnocellular  stream  and  figure  12.2c  of  the  parvocellular  stream  (Rock¬ 
land  and  Lund  1983;  Lund  1988a, b;  Martin  1988).  The  diagram  shows  the 
main  connections;  additional  secondary  ones  will  not  be  considered.  The 
connections  are  drawn  in  a  manner  suggested  by  the  model,  and  they  in¬ 
cludes  both  known  connections  (thick  arrows)  and  connections  predicted 
by  the  proposed  scheme  but  for  which  empirical  evidence  is  partial  or 
lacking  (thin  arrows). 

The  pattern  of  connections  in  the  two  streams  appears  to  be  in  gen¬ 
eral  agreement  with  the  counterstreams  structure  and  figure  12.2a.  This 
structure  can  be  used  as  a  repeating  circuit  to  utilize  the  cortex  inherent 
parallelism  and  combine  ascending  and  descending  information  flows.  If 
the  general  hypothesis  regarding  the  counterstreams  structure  is  broadly 
correct,  then  a  number  of  predictions  can  be  made  regarding  the  main 
connectivity  patterns  within  and  between  areas.  One  general  prediction  is 
the  possible  distinction  between  the  AS  and  DS  subpopulations.  This  sep¬ 
aration  reflects  the  most  straightforward  implementation  of  the  scheme, 
however,  some  alternatives  can  exist  without  violating  the  constraints  of 
the  model. 

A  separation  between  the  ascending  and  descending  populations  is  ev¬ 
ident  in  the  connections  involving  layer  4:  the  ascending  streams  termi- 
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Figure  12.2  (a)  How  the  basic  counterstream  structure  may  be  embodied  in  cortical  connec¬ 
tivity.  The  structure  contains  two  interconnected  streams,  and  ascending  and  a  descending 
one.  The  ascending  path  goes  through  layer  4  and  the  ascending  superficial  population  (AS) 
to  the  next  area.  The  descending  path  goes  from  the  descending  superficial  (DS)  population 
to  DI  (descending  infra)  back  to  the  first  area.  Thin  arrows  show  pathways  that  "leap  over"  a 
step  in  the  stream.  Inhibitory  and  long-range  intraareal  connections  are  not  shown.  See  text 
for  more  details.  ( b )  The  main  connections  according  to  the  model  along  the  magno  stream 
from  the  LGN  via  VI  to  V2.  VI  is  also  connected  to  other  visual  areas,  not  shown  in  the 
diagram.  The  connections  are  drawn  in  a  manner  suggested  by  the  model  and  (a).  Thick 
arrows,  established  connections;  thin  arrows,  connections  predicted  by  the  model,  (c)  The 
main  connections  according  to  the  model  along  the  parvo  stream  from  the  LGN  via  VI  to  V2. 
Thick  arrows,  established  connections;  thin  arrows,  connections  predicted  by  the  model. 
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nate  in  layer  4,  the  descending  streams  always  avoid  it.  In  the  superficial 
layers  the  situation  is  more  difficult  to  assess,  and  the  available  evidence 
is  at  present  restricted.  In  the  magnocellular  projection  from  VI  to  V2 
the  forward  projection  originated  mainly  in  4B,  while  the  back  projec¬ 
tion  in  mainly  to  other  layers  (2b).  It  is  further  expected  that  even  when 
the  superficial  layers  provide  both  the  source  and  the  target  of  connec¬ 
tions  to  another  area,  there  will  in  fact  often  be  a  separation  to  the  AS/DS 
subpopulations  (presumably  a  vertical,  rather  than  horizontal  separation; 
Wong-Riley  1978;  Zeki  and  Shipp  1988).  If  these  populations  exist,  they 
should  be  connected  in  a  reciprocal  manner.  A  related  expectation  derived 
from  the  model  is  the  existence  of  priming-type  synaptic  interactions,  that 
is,  excitatory  synaptic  input  that  by  itself  may  not  be  very  effective  in  driv¬ 
ing  the  target  cells,  but  that  facilitates  the  effects  of  subsequent  inputs  to 
these  cells. 

An  example  of  a  more  specific  prediction  is  that  in  the  magnocellular 
stream  the  model  suggests  reciprocal  interconnections  between  layer  4B 
(playing  the  part  of  AS  in  the  model),  and  layers  1-3,  the  recipients  of 
descending  projections  from  V2  (DS  in  the  model).  The  projection  from  4B 
to  the  superficial  layers  is  well  established.  It  is  also  known  (Lund  1988a) 
that  4B  pyramidal  cells  send  apical  dendrites  to  the  superficial  layers  where 
the  connection  may  take  place.  It  has  been  noted  in  this  regard  (Lund 
1988a)  that  the  significant  4B  projection  has  a  surprisingly  limited  effect 
on  properties  of  superficial  layers  units  (such  as  directional  selectivity). 
In  figure  12.2b,  this  connection  has  mainly  a  priming  role,  and  therefore 
the  lack  of  direct  effect  is  not  unexpected.  A  related  prediction  is  the 
expectation  that  the  same  superficial  cells  connected  to  4B  will  also  be  the 
recipients  of  descending  projections  from  V2. 

The  model  also  includes  a  reciprocal  connection  between  layer  4  and 
the  LGN-projecting  cells  in  layer  6.  The  projection  from  6  to  4  is  well- 
established  in  both  cat  (McGuire  et  al.  1984)  and  monkey  (Lund  1988a), 
and  there  is  support  for  the  opposite  connection  as  well  (Lund  and  Boothe 
1975).  It  is  also  interesting  to  note  in  this  regard  that  the  population  of 
layer  6  cells  projecting  back  to  the  LGN  was  found  (in  the  cat)  to  be  the 
same  cells  that  are  also  connected  to  layer  4C,  by  axonal  collaterals  and 
dendritic  arbors  (Katz  et  al.  1984),  in  accordance  with  the  connectivity  in 
figure  12.2b,c. 

The  connections  between  layers  4  and  6  are  expected  in  the  model  to 
have  a  priming  effect  (not  necessarily  the  only  effect,  see  Martin  1988), 
and  this  notion  has  some  physiological  support.  It  was  found  (Ferster 
and  Lindstrom  1985)  that  using  electrical  activation  of  layer  6  cells  by 
antidromic  activation  increased  the  probability  of  layer  4  firing,  and  most 
cells  fire  multiple  spikes  in  response  to  each  stimulus.  Under  the  opposite 
conditions,  when  layer  6  is  inactivated,  the  main  observed  effect  was  the 
reduction  in  excitability  of  layer  4  cells  (Grieve  et  al.  1991). 

From  an  anatomical  standpoint,  EM  reconstructions  (McGuire  et  al. 
1984)  have  shown  terminations  of  layer  6  axons  on  smooth  and  sparsely 
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spiny  cells.  The  model  suggests  also  a  projection  onto  layer  4  spiny  cells, 
and  this  remains  to  be  clarified  in  future  anatomical  studies. 

Lateral  Connections  between  Areas 

Connections  between  cortical  areas  (not  only  visual,  also  somatosensory 
and  motor)  can  be  classified  into  "forward,"  "backward,"  and  "lateral" 
connections,  on  the  basis  of  the  laminar  distribution  of  their  source  and 
destination  (Rockland  and  Pandya  1979;  Maunsell  and  Van  Essen  1983; 
Friedman  1983;  Van  Essen  1985;  Zeki  and  Shipp  1988;  Andersen  et  al. 
1990;  Boussaud  et  al.  1990;  Felleman  and  Van  Essen  1991).  Lateral  connec¬ 
tions  terminate  in  all  layers,  and  their  origin  is  bilaminar,  from  the  supra 
and  infra  layers.  The  lateral  pattern  is  relatively  complex  and  sometimes 
perplexing  (Felleman  and  Van  Essen  1991).  It  is  therefore  interesting  that  a 
number  of  its  main  features  can  be  derived  almost  directly  from  the  model. 
The  counterstreams  structure  does  not  have  a  distinct,  third  type  of  con¬ 
nections,  but  it  allows  forward  and  backward  connection  simultaneously 
in  both  directions,  and  it  can  include  lateral  connections  by  assuming  that 
they  are  the  "union"  of  ascending  and  descending  connections.  If  this  view 
is  correct,  then  the  main  connections  participating  in  the  lateral  connection 
can  be  inferred  from  the  basic  scheme  (figure  12.2a).  According  to  the 
model,  they  include  the  direct  connections:  AS  — ►  4,  and  6  — ►  DS,  as  well 
as  the  connections  that  leap  over  one  stage  in  the  diagram,  namely,  AS  — > 
AS,  DS  — ►  DS,DI,  and  DI  — >  DI. 

The  origin  of  the  projections  according  to  the  model  would  be  bilaminar, 
and  the  terminations  would  span  all  layers,  in  agreement  with  the  observed 
pattern.  This  can  also  explain  several  difficulties  such  as  the  problem  of 
irregular  terminations  (Felleman  and  Van  Essen  1991),  that  occurs,  e.g., 
when  some  of  the  terminations  are  restricted  to  layer  4  of  the  target  area 
while  others  show  columnar  terminations.  This  was  termed  F/C  (i.e.,  a 
mixture  of  "four"  and  "columnar")  paradoxical  termination,  since  termi¬ 
nation  in  layer  4  is  usually  a  signature  for  ascending  connections,  while  a 
columnar  termination  signifies  lateral  connections.  Usually  these  connec¬ 
tion  types  are  distinct,  but  some  interconnections  exhibit  a  mixed  type.  In 
the  counterstreams  structure,  the  main  point  to  note  is  that  the  lateral  con¬ 
nections  from  the  superficial  layers  of  area  A  to  target  area  B  are  composed 
of  two  subprojections:  AS  -+  4  (ascending)  and  DS  — >  DS,  DI  (descending). 
(In  addition,  there  is  a  descending  connection  DI  — ►  DI.)  Anterograde  la¬ 
beling  of  the  upper  layers  of  area  A  can  therefore  show  a  mixed  pattern 
of  terminations:  4  alone  (from  AS  of  A),  or  a  columnar  termination  (from 
AS  and  DS).  This  is  in  agreement  with  the  F/C  paradoxical  termination 
(Felleman  and  Van  Essen  1991).  It  can  also  (by  labeling  the  DS  alone)  show 
a  bilaminar  pattern  of  connections;  this  can  account  for  the  other  types  of 
irregular  terminations.  If  this  account  is  correct,  it  also  provides  support 
for  the  existence  of  the  separate  AS  and  DS  subpopulations. 
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Possible  Priming  Mechanisms 

Synaptic  interactions  in  the  model  include  priming-type  effects  between 
the  complementary  streams.  Although  this  has  not  been  studied  directly, 
some  known  or  physiologically  plausible  mechanisms  may  play  a  role  in 
such  priming  interactions. 

One  such  mechanism  has  been  investigated  by  Miller  et  al.  (1989).  In 
this  study,  responses  of  cells  in  the  cat's  visual  cortex  to  visual  stimulation 
were  profoundly  suppressed  by  the  blocking  of  NMD  A  receptors  (by  using 
APV).  A  possible  mechanism  proposed  by  Miller  et  al.  (1989)  by  which 
NMDA  receptors  could  control  the  responsiveness  of  cells  is  that  such 
receptors,  when  activated  in  neocortex  pyramidal  cells,  cause  a  slow,  long- 
lasting  EPSP  that  rises  to  a  peak  in  10-75  msec.  They  suggest  that  this 
slow  EPSP  could  provide  a  base  on  which  subsequent  subthreshold  input 
would  become  suprathreshold. 

Another  mechanism  has  been  proposed  by  Koch  (1987)  and  by  empirical 
studies  in  the  LGN  (Esguerra  et  al.  1989;  Sherman  et  al.  1990).  The 
proposed  mechanism  makes  use  of  the  capacity  of  NMDA  receptors  to 
increase  the  cell's  response  in  a  nonlinear  fashion,  as  a  function  of  the 
depolarization  in  the  postsynaptic  cell.  The  proposal,  in  the  context  of 
the  LGN  (Koch  1987),  is  that  the  descending  stimulation  from  the  cortex 
can  cause  long-lasting  subthreshold  depolarization,  and  that  the  ascending 
stimulation  involves  receptors  of  the  NMDA  type.  If  ascending  stimulation 
arrives  while  the  units  are  still  in  a  depolarized  state,  the  response  will 
be  enhanced  significantly.  A  similar  mechanism  based  on  the  nonlinear 
properties  of  the  NMDA-type  receptor  could  be  used  for  priming  between 
the  streams. 

The  long-lasting  depolarization  could  be  contributed  by  postsynaptic 
responses  with  slow  time  course,  similar  to  the  persistent  Na+  channel, 
or  the  It  calcium  channel  (McCormick  1990).  Synaptic  mechanisms  of  this 
type  have  been  implicated  in  cortical  cells  (Hirsch  and  Gilbert  1991;  Amitai 
et  al.  1993).  A  similar  effect  can  also  be  contributed  by  the  activation  of 
distal  parts  of  the  dendritic  tree.  Simulations  of  pyramidal  cells  (Stratford  et 
al.  1989)  have  shown  that  such  stimulation  can  have  a  significant  temporal 
extent. 

Priming  can  thus  be  obtained  by  long-lasting  depolarization,  caused  by 
the  properties  of  ionic  channels,  the  NMDA  receptors,  or  the  stimulation  of 
distal  dendritic  branches,  combined  with  subsequent  input,  added  either 
linearly  or  nonlinearly.  Other  mechanisms,  not  considered  here,  might 
participate  as  well.  Although  the  details  are  not  known,  it  appears  that 
synaptic  mechanisms  for  priming  connections  are  physiologically  plausi¬ 
ble,  and  it  will  be  of  interest  to  try  to  test  them  empirically. 
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Effects  of  the  Feedback  Projection 

According  to  the  sequence-seeking  scheme,  the  physiological  effects  of  the 
descending  projections  can  assume  two  different  forms:  either  the  priming 
and  modulation  of  the  ascending  stream  or  the  direct  activation  of  a  lower 
area. 

Both  effects  have  been  observed  in  physiological  studies,  modulatory 
(Nault  et  al.  1990;  Sandell  and  Schiller  1982),  as  well  as  direct  excitatory 
effects  (Mignard  and  Malpeli  1991;  Cauller  and  Kullics  1991b).  Further 
predictions  of  the  model  regarding  the  modulatory  effects  include  (1)  the 
facilitation  by  the  back  projections  will  not  require  strict  temporal  coin¬ 
cidence,  (2)  similar  modulatory  effects  are  also  likely  to  be  induced  by 
ascending  signals  on  descending  ones,  (3)  the  two  effects  may  be  segre¬ 
gated  into  two  distinct  subpopulations:  in  figure  12.1c  B  can  be  directly 
driven  along  the  descending  stream,  but  patterns  such  as  B  on  the  ascend¬ 
ing  stream  are  expected  to  show  modulatory  effects. 

In  summary,  the  computation  proposed  by  the  sequence-seeking  model 
is  a  bidirectional  search  performed  by  top-down  and  bottom-up  streams 
seeking  to  meet.  Key  properties  of  the  scheme  include  the  simultaneous 
exploration  of  multiple  alternatives,  the  relatively  simple,  uniform,  and 
extensible  structure,  the  flexible  use  of  "bottom-up"  and  "top-down"  se¬ 
quences  that  can  meet  at  any  level,  and  the  learning  of  complete  sequences 
by  a  simple  local  reinforcement  rule.  The  implementation  in  the  coun¬ 
terstream  structure  proposes  a  computational  account  for  several  basic 
features  of  cortical  circuitry,  such  as  the  predominantly  reciprocal  connec¬ 
tivity  between  cortical  areas,  the  forward,  backward,  and  lateral  connection 
types,  the  regularities  in  the  distribution  patterns  of  interarea  connections, 
the  organization  in  5-6  main  layers,  and  the  effects  of  back  projections. 
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INTRODUCTION 

Our  understanding  of  how  the  cerebral  neocortex  carries  out  its  essential 
functions  has  progressed  through  several  important  stages  over  the  past 
three  decades.  In  the  1960s,  Hubei  and  Wiesel  provided  the  first  detailed 
characterization  of  orientation  selectivity,  direction  selectivity,  and  binocu¬ 
lar  interactions  in  visual  cortex.  Equally  important,  they  proposed  simple 
and  attractive  models  that  could  account  for  these  properties  by  invoking 
"bottom-up"  convergence  of  ascending  inputs  in  a  strictly  serial  processing 
hierarchy  (Hubei  and  Wiesel  1965b). 

In  the  1970s  these  ideas  were  modified  by  the  discovery  of  parallel  path¬ 
ways  within  the  visual  system  and  by  evidence  for  the  importance  of  intra- 
cortical  inhibition  and  other  aspects  of  local  intrinsic  circuitry  of  the  cortex 
(Stone  et  al.  1979;  Orban  1984).  This  picture  subsequently  became  enriched 
by  the  realization  that  intrinsic  connections  extend  rather  widely  within 
each  cortical  area  (Gilbert  1983)  and  also  that  there  are  massive  feedback 
pathways  projecting  from  higher  to  lower  cortical  areas  (Rockland  and 
Pandya  1979;  Maunsell  and  Van  Essen  1983).  The  function  of  these  long¬ 
distance  and  feedback  pathways  is  not  fully  understood,  but  it  is  widely 
presumed  that  they  subserve  the  strong  modulatory  effects  known  to  arise 
from  outside  the  classical  receptive  field  (Allman  et  al.  1985a;  Knierim 
and  Van  Essen  1992).  This  represents  a  form  of  "top-down"  processing 
that  makes  cortical  responses  to  sensory  stimuli  highly  dependent  on  the 
context  in  which  they  are  presented. 

Recently,  we  and  others  (Poggio  1984;  Baron  1987;  Anderson  and  Van  Es¬ 
sen  1987, 1993;  Van  Essen  and  Anderson  1990;  Olshausen  et  al.  1992, 1993) 
have  argued  that  a  critical  component  is  missing  from  this  view  of  corti¬ 
cal  processing,  namely,  an  explicit  mechanism  for  dynamically  regulating 
the  flow  of  information  within  and  between  cortical  areas.  The  need  for 
such  a  mechanism  arises  from  computational  considerations  relating  to  (1) 
the  vast  amounts  of  data  continuously  impinging  on  the  nervous  system, 
(2)  the  finite  computational  resources  that  can  be  dedicated  to  any  given 
task,  and  (3)  the  need  for  highly  flexible  linkages  between  a  large  number  of 
physically  separate  modules.  In  our  previous  articulations  of  this  hypothe- 


sis  we  have  focused  on  the  visual  system,  but  we  believe  that  the  concepts 
and  issues  are  equally  applicable  to  other  systems  as  well.  To  provide 
an  intuitive  illustration  of  this  point,  we  begin  by  considering  two  exam¬ 
ples,  one  relating  to  sensory  processing  and  the  other  to  motor  processing. 
While  superficially  rather  different,  we  will  argue  that  these  two  types  of 
processing  may  share  important  similarities  in  their  "deep  structure"  at  the 
computational  level  and  at  the  level  of  neurobiological  implementation. 

Visual  Attention 

First,  consider  the  phenomenon  of  directed  visual  attention,  which  has  a 
number  of  characteristics  illustrated  by  the  following  hypothetical  task. 
Suppose  that  an  observer  stares  at  a  fixation  spot  on  an  otherwise  blank 
screen.  A  simple  cuing  stimulus  is  flashed  on  the  screen,  and  it  is  followed 
after  a  brief  interval  by  an  array  of  letters  arranged  concentrically  about 
the  fixation  point.  If  the  cue  is  a  small  spot  occurring  in  the  location  of  the 
leftmost  target  letter,  then  attention  will  be  directed  to  that  location  (figure 
13.1a);  the  observer  will  be  able  to  identify  that  letter  as  a  C  but  will  be 
unable  to  reliably  identify  the  remaining  letters  in  the  array  (cf.  Nakayama 
and  Mackeben  1989;  Krose  and  Julesz  1989).  If  the  cue  is  switched  to  the 
rightmost  letter,  the  observer's  attention  will  likewise  shift  so  as  to  allow 
identification  of  the  letter  T  (figure  13.1b).  A  very  different  result  will  occur, 
though,  if  the  cuing  spot  is  expanded  to  cover  the  entire  target  array  (figure 
13.1c).  In  attending  to  the  overall  pattern,  the  observer  will  immediately 
recognize  the  ring-shaped  geometric  configuration  of  letters,  but  the  spatial 
resolution  within  this  broader  "window  of  attention"  will  be  too  coarse  to 
allow  reliable  identification  of  any  of  the  individual  letters,  let  alone  the 
entire  word  that  they  form  (cf.  Van  Essen  et  al.  1991). 

Our  analysis  of  this  phenomenon  starts  with  the  argument  that  visual 
attention  is  a  process  that  has  evolved  to  subserve  general  purpose  object 
recognition.  The  ability  to  recognize  a  wide  range  of  highly  complex  pat¬ 
terns  (e.g.,  faces)  is  too  computationally  demanding  to  have  the  requisite 
neural  machinery  replicated  separately  for  each  location  in  the  visual  field. 
Instead,  this  machinery  is  relegated  to  a  small  number  of  modules  situated 
at  high  levels  of  the  visual  hierarchy  in  inferotemporal  cortex.  In  our  view, 
visual  attention  is  a  mechanism  for  dynamically  regulating  information 
flow  so  as  to  bring  information  from  the  appropriate  region  of  the  visual 
field  and  in  an  appropriate  format  to  the  appropriate  high-level  object 
recognition  center.  Five  general  characteristics  of  spatially  directed  visual 
attention  warrant  explicit  mention  in  connection  with  this  hypothesis. 

•  Attention  can  readily  be  directed  to  different  locations  and  to  different 
spatial  scales  (Sperling  and  Dosher  1986;  Eriksen  and  Murphy  1987;  Julesz 
1991). 

•  Attentional  shifts  can  be  initiated  by  bottom-up  cues  and/or  top-down 
influences.  When  initiated  by  bottom-up  cues  (as  in  the  above  example). 
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Figure  13.1  Directed  visual  attention  (a-c)  and  directed  motor  control  ( d-f ).  The  process  of 
shifting  attention  to  different  locations  and  to  different  scales  (shaded  regions)  is  analogous  in 
important  ways  to  the  process  of  directing  a  basic  motor  routine  (e.g.,  up-down  movements 
signified  by  arrows)  to  different  digits  or  combinations  of  digits  in  response  to  instructions 
embodied,  say,  by  the  blinking  light  icons. 


attentional  shifts  occur  with  a  finite  temporal  delay  (50-100  msec)  and  tend 
to  persist  at  any  given  location  for  a  relatively  brief  period  (Nakayama  and 
Mackeben  1989;  Krose  and  Julesz  1989).  In  this  respect,  they  are  analogous 
to  saccadic  eye  movements,  except  that  they  occur  on  a  faster  time  scale. 

•  Attention  is  directed  not  simply  to  the  initial  cue,  but  to  whatever  image 
data  lie  within  the  window  of  attention  once  it  has  been  shifted.  For  exam¬ 
ple,  a  highly  salient  cue  might  be  followed  by  a  very  subtle  spatial  pattern 
whose  characteristics  can  nonetheless  be  scrutinized  via  attentive  process¬ 
ing  (Nakayama  and  Mackeben  1989;  Krose  and  Julesz  1989;  Julesz  1991). 
In  this  respect,  the  cue  evidently  serves  simply  as  a  gating  mechanism  to 
regulate  the  flow  of  image  data. 

•  Visual  attention  acts  as  an  informational  bottleneck  that  reduces  to  man¬ 
ageable  levels  the  amount  of  image  data  reaching  high-level  cortical  centers 
involved  in  pattern  recognition.  By  our  estimate,  less  than  0.1%  of  the  in¬ 
formation  carried  in  the  optic  nerve  at  any  given  moment  passes  through 
the  attentional  bottleneck  (Van  Essen  et  al.  1991). 

•  For  complex  patterns  to  be  recognized,  it  is  important  that  information 
about  spatial  relationships  be  preserved  within  the  window  of  attention. 
However,  because  of  the  narrowness  of  the  information  bottleneck,  the 
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spatial  resolution  within  the  window  is  limited  to  the  equivalent  of  about 
30  x  30  resolution  elements. 

Directed  Motor  Control 

An  analogous  set  of  issues  arise  in  certain  aspects  of  motor  function,  which 
we  shall  refer  to  as  "directed  motor  control."  The  essential  notion  is  that  our 
motor  system  is  capable  of  selecting  among  a  large  repertoire  of  stereotyped 
motor  routines,  each  of  which  can  be  distributed  to  different  body  parts 
according  to  specific  instructions  and  cues.  An  instructive  example  of  this 
is  that  of  an  orchestral  conductor.  In  general,  a  given  section  of  music  has 
a  steady  beat  that  must  be  communicated  to  the  orchestra  by  rhythmic 
movements  of  the  baton.  The  conductor  has  a  wide  repertoire  of  specific 
movement  routines  (simple  up-down  strokes,  triangular  strokes,  and  a 
variety  of  more  complex  patterns),  from  which  one  must  be  selected  to 
convey  the  desired  beat.  In  addition,  any  particular  set  of  strokes  can 
be  transmitted  via  movements  of  the  hand,  the  arm,  a  single  finger,  or 
whatever  combination  the  conductor  chooses  from  moment  to  moment. 

To  sharpen  the  analogy  with  directed  visual  attention,  note  that  the  cues 
for  switching  from  one  type  of  movement  to  another  can  be  dictated  by 
bottom-up  sensory  cues  as  well  as  the  top-down  cognitive  control  that 
would  occur  in  the  case  of  the  orchestral  conductor.  For  example,  suppose 
that  an  observer  is  instructed  to  move  an  appendage  up  and  down  at  a 
steady  rate,  in  response  to  a  set  of  lights  situated  adjacent  to  the  fingertips 
(figure  13.1,  bottom  row).  Thus,  according  to  which  particular  lights  were 
blinking,  the  observer  would  wiggle  the  index  finger  (figure  13. Id),  the 
little  finger  (figure  13.1e),  or  all  fingers  simultaneously  (figure  13.1  f). 

We  contend  that  motor  coordination  is  too  computationally  demanding 
to  have  the  circuitry  for  de  novo  generation  of  each  possible  motor  rou¬ 
tine  separately  represented  for  each  of  the  many  dozens  of  appendages 
in  the  body.  Instead,  we  support  the  notion  that  there  exist  only  one  or  a 
few  central  representations  of  each  motor  routine  (wiggles,  circles,  figure-8 
movements,  etc.)  that  can  be  generated  via  top-down  (cognitive)  process¬ 
ing  (Keele  1981, 1986).  Directed  motor  control  is  a  process  for  dynamically 
regulating  information  flow  so  as  to  distribute  information  about  a  desired 
motor  routine  to  the  appropriate  target  appendage(s).  More  specifically, 
there  are  five  characteristics  of  directed  motor  control  that  share  important 
similarities  with  those  outlined  above  for  visual  attention. 

•  A  given  motor  routine  can  be  distributed  to  target  appendages  at  differ¬ 
ent  locations  and  to  different  spatial  scales  (e.g.,  one  digit  or  many). 

•  Directed  motor  control  can  be  initiated  by  a  variety  of  bottom-up  cues 
and/or  top-down  influences.  A  given  motor  routine  can  be  sustained  more 
or  less  indefinitely  if  desired,  though;  it  does  not  show  the  strong  transient 
component  characteristic  of  some  aspects  of  visual  attention. 

•  The  cue  used  for  initiation  need  not  be  a  direct  replica  of  the  trajectory 
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that  is  to  be  carried  out  in  any  given  motor  routine.  Rather,  the  cue  can 
be  an  essentially  arbitrary  abstract  stimulus  that  serves  simply  as  a  gating 
mechanism  to  regulate  the  flow  of  motor  signals. 

•  Directed  motor  control  represents  a  strategy  that  reduces  to  manageable 
levels  the  amount  of  learned  information  that  needs  to  be  stored  in  the 
circuits  involved  in  generating  specific  motor  routines. 

•  For  a  given  motor  routine  to  be  executed,  it  is  important  that  the  spa- 
tiotemporal  sequence  of  information  associated  with  a  given  pattern  be 
preserved  as  it  is  distributed  to  the  appropriate  target  appendages. 

The  common  theme  in  both  of  these  examples  is  that  a  sophisticated 
system  for  dynamically  controlling  information  flow  may  be  essential  for 
the  brain  to  carry  out  its  duties.  The  notion  that  gating  mechanisms  are 
important  in  CNS  function  is  by  no  means  new.  For  example,  gating  has 
long  been  suspected  to  play  a  critical  role  in  the  modulation  of  pain  sen¬ 
sitivity  (Melzack  and  Wall  1965;  Fields  and  Basbaum  1978).  However,  the 
evidence  for  it  in  cortical  function  has  been  fragmentary  and  controver¬ 
sial.  To  guide  further  experimentation,  there  is  a  clear  need  for  detailed, 
neurobiologically  plausible  models.  In  this  chapter,  we  will  review  the 
key  features  of  our  model  of  directed  visual  attention,  emphasizing  recent 
progress  in  making  it  an  autonomous,  self-contained  process.  Then  we 
will  discuss  ways  in  which  this  model  might  be  adapted  to  account  for 
motor  functions.  Finally,  we  will  briefly  consider  the  relevance  of  these 
strategies  for  cognitive  processing. 

A  MODEL  OF  DIRECTED  VISUAL  ATTENTION 

The  goal  of  our  model  is  to  provide  a  neurobiologically  plausible  mecha¬ 
nism  for  shifting  and  rescaling  the  representation  of  an  object  from  the  reti¬ 
nal  reference  frame  into  an  object-centered  reference  frame.  Information  in 
the  retinal  reference  frame  is  represented  on  a  neural  map  (for  instance,  the 
topographic  representation  in  VI),  and  we  hypothesize  that  information 
in  the  object-centered  reference  frame  is  also  represented  on  a  neural  map 
that  preserves  some  degree  of  information  about  local  spatial  relationships. 
This  is  illustrated  in  the  simplest  possible  scheme  in  figure  13,2a,  which 
shows  a  purely  pixel-based  (but  object-centered)  representation  providing 
the  input  to  a  high-level  associative  memory.  However,  we  do  not  presume 
that  only  "pixels"  are  being  routed  into  the  high  level  areas.  Rather,  each 
sample  node  in  the  high  level  map  may  be  thought  of  as  a  feature  vector 
representing  various  local  image  properties,  such  as  orientation,  texture, 
and  depth  (figure  13.2b).  For  simplicity,  our  current  computer  model  simu¬ 
lates  the  routing  of  only  pixel-based  data.  However,  it  should  be  relatively 
straightforward  to  incorporate  into  future  models  the  additional  process¬ 
ing  needed  for  dynamic  routing  of  more  complex  feature  vectors. 

To  topographically  map  an  arbitrary  section  of  the  input  (retina)  onto 
the  output  (object-centered  reference  frame),  each  neuron  in  the  output 
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Figure  13.2  Shifting  and  rescaling  the  window  of  attention.  The  image  within  the  window 
of  attention  in  the  retina  is  remapped  onto  an  array  of  sample  nodes  in  an  object-centered 
reference  frame.  ( a )  In  the  simplest  scheme,  each  "pixel"  in  the  object-centered  reference 
frame  represents  image  luminance,  (b)  More  realistically,  each  pixel  should  presumably 
correspond  to  a  feature  vector  that  integrates  over  a  somewhat  larger  spatial  region  and 
represents  orientation,  depth,  texture,  etc. 


stage  needs  to  have  dynamic  access  to  a  large  number  of  neurons  in  the 
input  stage.  In  the  brain,  this  access  must  necessarily  be  obtained  via 
the  physical  hardware  of  axons  and  dendrites.  Since  these  pathways  are 
physically  fixed  for  the  time  scale  of  interest  to  us  (<  1  sec),  there  needs 
to  be  a  way  of  dynamically  modifying  their  strengths.  We  propose  that 
the  efficacy  of  transmission  of  these  pathways  is  modulated  by  the  activity 
of  control  neurons  whose  primary  responsibility  is  to  dynamically  route 
information  through  successive  stages  of  the  cortical  hierarchy. 

A  Simple  Dynamic  Routing  Circuit 

Figure  13.3a  illustrates  a  simple,  ID  dynamic  routing  circuit.  Each  node 
in  the  circuit  forms  a  linear  weighted  sum  of  its  inputs,  and  the  weights 
are  dynamically  modified  by  a  set  of  control  neurons  to  set  the  position 
and  scale  of  the  window  of  attention.  The  hierarchical  connection  scheme 
shown  has  the  attractive  property  of  keeping  the  fan-in  (number  of  inputs) 
on  any  node  fixed  to  a  relatively  low  number  while  allowing  the  nodes  in 
the  output  layer  access  to  any  part  of  the  input  layer.  This  property  will  be 
important  in  scaling-up  the  model. 

An  example  of  how  the  weights  might  be  set  for  different  positions  and 
sizes  of  the  window  of  attention  is  shown  in  figure  13.3b,c.  When  the 
window  is  at  its  smallest  size  (i.e.,  at  the  same  resolution  as  the  input 
stage,  figure  13.3b)  ,  the  weights  are  set  so  as  to  establish  a  one-to-one 
correspondence  between  nodes  in  the  output  and  the  attended  nodes  in 
the  input.  When  the  window  is  larger,  the  weights  must  be  set  so  that 
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Figure  13.3  A  simple,  one-dimensional  dynamic  routing  circuit,  (fl)  Connections  are  shown 
for  the  leftmost  node  in  each  layer.  The  connections  for  the  other  nodes  are  the  same,  but 
merely  shifted.  N  denotes  the  number  of  nodes  within  each  layer.  A  set  of  control  units  (not 
explicitly  shown)  provides  the  necessary  signals  for  modulating  connection  strengths  so  that 
the  image  within  the  window  of  attention  in  the  input  is  mapped  onto  the  output  nodes. 
(] b ,  c)  Some  examples  of  how  the  weights  would  be  set  for  different  positions  and  sizes  of  the 
window  of  attention.  The  gray-level  of  each  connection  denotes  its  strength.  Essentially,  the 
sets  of  weights  feeding  into  each  node  are  samples  of  a  Gaussian  interpolation  function  that 
is  shifted  and  rescaled  according  to  how  the  window  of  attention  is  shifted  and  rescaled. 
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multiple  inputs  converge  onto  a  single  output  node,  resulting  in  a  lower- 
resolution  representation  of  the  contents  of  the  window  of  attention  on  the 
output  nodes.  One  can  see  from  this  illustration,  then,  that  the  challenge  in 
controlling  the  routing  circuit  lies  in  properly  setting  the  weights  to  yield 
the  desired  position  and  size  of  the  window  of  attention.  Note  that  in 
general  there  are  many  possible  solutions  in  terms  of  the  combinations  of 
weights  that  could  achieve  any  particular  input-output  transformation. 

Control 

Since  the  purpose  of  attention  is  to  focus  the  neural  resources  for  recog¬ 
nition  on  a  specific  region  within  a  scene,  it  would  make  sense  for  the 
attentional  window  to  be  automatically  guided  to  salient,  or  potentially 
informative  areas  of  the  visual  input.  Salient  areas  can  often  be  defined 
on  the  basis  of  relatively  low-level  cues  —  such  as  pop-out  due  to  motion, 
depth,  texture,  or  color  (e.g.,  Koch  and  Ullman  1985).  Here,  we  utilize 
a  very  simple  measure  of  salience  based  on  luminance  pop-out  in  which 
attention  is  attracted  to  "blobs"  in  a  low-pass  filtered  version  of  a  scene.  (A 
blob  may  be  defined  simply  as  a  contiguous  cluster  of  activity  within  an  im¬ 
age.)  Attention  can  also  be  directed  via  voluntary  or  cognitive  (top-down) 
influences,  but  these  are  not  incorporated  into  our  current  model. 

We  propose  the  following  simple  but  useful  strategy  for  an  autonomous 
visual  system  (see  figure  13.4): 

1.  Form  a  low-pass  filtered  version  of  the  scene  so  that  objects  are  blurred 
into  blobs. 

2.  Select  one  of  the  blobs  from  the  low-pass  image — whichever  is  brightest 
or  largest  —  and  set  the  position  and  size  of  the  window  of  attention  to 
match  the  position  and  size  of  the  blob. 

3.  Feed  the  high-resolution  contents  of  the  window  of  attention  to  an  as¬ 
sociative  memory  for  recognition. 

4.  If  a  match  with  one  of  the  memories  is  close  enough  (by  some  as  yet 
unspecified  criterion),  then  consider  the  object  to  have  been  recognized; 
note  its  identity,  location,  and  size  in  the  scene.  If  there  is  not  a  good  match, 
then  consider  the  object  to  be  unknown;  either  learn  it  or  disregard  it. 

5.  Now  inhibit  this  part  of  the  scene  and  go  to  step  2  (find  the  next  most 
salient  blob). 

The  next  three  sections  describe  details  for  carrying  out  steps  2, 3,  and  5. 
Step  1  is  trivial,  whereas  step  4  is  a  high-level  problem  beyond  the  scope 
of  this  paper  (see  chapter  7  by  Mumford). 

Focusing  the  Window  of  Attention  on  a  Blob  To  formulate  a  solution  for 
controlling  the  routing  circuit,  we  will  simplify  matters  even  further  and 
consider  controlling  a  simple  two-layer,  one-dimensional  network  consist¬ 
ing  of  an  input  layer  and  an  output  layer  only  (figure  13.5).  The  output 
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1.  Blur  objects  into  blobs.  2.  Focus  the  window  of 

attention  on  a  blob. 


3.  Feed  the  high  resolution  4.  Note  the  location,  size,  5.  Move  on  to  the  next 

contents  of  the  window  of  and  identity  of  the  object.  blob  and  repeat, 

attention  to  an  associative 
memory. 


Figure  13.4  A  simple  attentional  strategy  for  recognizing  objects  in  a  scene.  Objects  are 
preattentively  segmented  via  lowpass  filtering.  Once  an  object  has  been  localized,  the  contents 
of  the  window  of  attention  are  fed  to  an  associative  memory  for  recognition.  This  process  is 
then  repeated  ad  infinitum. 

units,  Iout,  compute  their  activation  from  the  input  units,  Iin,  via  a  simple 
linear  summation 

r  =  £Mn  03.1) 

i 

and  the  weights  are  dynamically  modified  by  a  set  of  control  neurons,  c, 
via 

=  £c,tl>  (13.2) 

k 

where  denotes  the  weight  with  which  control  neuron  c *  modulates  the 
strength  of  synapse  (z,  ;)  between  neuron;  in  the  input  layer  and  neuron  i  in 
the  output  layer.  For  ease  of  notation,  this  equation  describes  the  general 
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Figure  13.5  A  very  simple  one-dimensional  routing  circuit  with  a  Gaussian  blob  presented 
to  the  input  units.  Each  control  unit  corresponds  to  a  different  position  of  the  window  of 
attention:  left  (co),  center  (ci ),  or  right  fo).  For  example,  to  accomplish  the  remapping  shown, 
the  values  on  the  control  units  should  be  C2  =  1  and  Co  =  C\  =  0.  The  circuitry  for  autonomous 
control  is  shown  on  the  left,  with  each  control  unit  receiving  input  from  a  Gaussian  receptive 
field  in  the  input  layer;  the  control  units  then  compete  among  each  other  via  negatively 
weighted  interconnections,  so  that  only  the  control  unit  corresponding  to  the  strongest  blob 
in  the  input  prevails. 


case  in  which  each  control  neuron  may  modulate  any  or  all  synapses  (/,  j).  In 
fact,  each  control  neuron  actually  modulates  only  a  local  group  of  synapses, 
and  so  will  be  nonzero  for  only  a  few  combinations  of  i,  j,  and  k.  In  the 
particular  circuit  of  figure  13.5,  the  T,,*  are  set  simply  so  that  each  control 
unit  ck  corresponds  to  a  global  position  of  the  window  of  attention,  but  in 
general  this  need  not  be  the  case. 

To  focus  the  window  of  attention  on  a  blob  in  the  input,  the  network's 
"goal"  will  be  to  fill  the  output  units  with  a  blob  while  maintaining  a  topo¬ 
graphic  correspondence  between  the  input  and  output  nodes  (figure  13.4, 
step  2).  Since  the  dynamic  variables  in  this  network  are  the  ck,  we  need  to 
formulate  a  dynamic  equation  for  the  ck  that  accomplishes  this  objective. 
We  can  achieve  this  by  letting  ck  follow  the  gradient  of  an  objective  func¬ 
tion,  Ebiob/  that  provides  a  measure  of  how  well  a  blob  is  focused  on  the 
output  units. 

Ebiob  =  -J2I°U'Gi  (13.3) 

i 

where  G  denotes  the  desired  blob  shape  (e.g.,  in  the  circuit  of  figure  13.5, 
G,  =  ).  The  second  part  of  the  objective  (maintaining  topography) 
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may  be  accomplished  by  letting  c k  follow  the  gradient  of  a  constraint  func¬ 
tion  that  favors  control  states  corresponding  to  translations  or  scalings  of 
the  input-output  transformation. 

^constraint  =  ”  «  5Z  °k  °l 

W 

where  the  constraint  matrix  Tc  is  chosen  so  as  to  appropriately  couple  the 
control  neurons.  For  example,  in  the  particular  circuit  of  figure  13.5,  Tc  is 
set  to  accomplish  a  winner-take-all  function  (Ty  =  -1,  k^/l)  since  each  ck 
corresponds  to  a  global  position  of  the  window  of  attention. 

A  dynamic  equation  for  ck  that  simultaneously  minimizes  both  of  these 
objective  functions  (Ebiob  and  Constraint)  is  given  by 

ck  =  a(uk)  (13.5) 

^  +  auk  =  r,'£J2G‘  r*  r  +  ^  c'  <13'6) 

i  j  l 

where  a,  77,  and  /?  are  constants  that  determine  the  rate  of  convergence  of 
the  system,  and  cr  is  a  sigmoidal  squashing  function.  The  neural  circuitry 
required  for  computing  equations  (13.5)  and  (13.6)  is  shown  in  figure  13.5. 
The  first  term  on  the  right  of  equation  (13.6)  is  computed  by  correlating  the 
Gaussian,  G,  with  a  shifted  version  of  the  input  (the  amount  of  shift  de¬ 
pends  on  the  index  k ).  The  second  term  is  computed  by  forming  a  weighted 
sum  of  the  activities  on  the  other  control  units.  These  two  results  are  then 
summed  together  and  passed  through  a  leaky  integrator  and  squashing 
function  to  form  the  output  of  the  control  unit,  ck.  Thus,  the  ck  essentially 
derive  their  inputs  directly  from  a  "blob  map,"  and  then  compete  among 
each  other  so  that  the  ck  corresponding  to  the  strongest  blob  prevails. 

This  circuit  could  easily  be  modified  to  allow  for  different  sizes  of  the 
window  of  attention  by  adding  another  set  of  control  units  for  each  desired 
size  of  the  window  of  attention.  For  example,  a  control  unit  corresponding 
to  a  large  window  of  attention  would  be  connected  to  the  weights,  Wjj, 
so  as  to  converge  multiple  inputs  into  a  single  output  node,  as  in  figure 
13.3c.  Such  a  control  unit  would  then  have  a  larger  Gaussian  receptive 
field  in  the  input  (or,  correspondingly,  it  would  receive  its  input  from  a 
coarser-grained  blob  map).  The  control  units  for  each  different  size  and 
position  would  then  compete  among  each  other  to  constrain  the  attentional 
window  to  be  of  a  single  size  and  position. 

Note  that  since  equations  (13.5)  and  (13.6)  are  nonlinear  in  ck/  there  exists 
the  potential  for  getting  stuck  in  local  minima.  This  is  not  a  serious  problem 
for  the  circuit  of  figure  13.5,  however,  because  the  control  neurons  are  so 
tightly  constrained  due  to  their  global  connectivity  (each  control  neuron 
corresponds  to  a  global  position  of  the  window  of  attention,  so  the  winner- 
take-all  circuit  ensures  a  single,  affine  transformation).  On  the  other  hand, 
in  larger  circuits  where  the  control  neurons  must  be  connected  to  local 
groups  of  synapses  instead  of  globally,  the  existence  of  local  minima  will 
present  a  significant  problem.  This  problem  can  be  overcome  by  utilizing 
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a  coarse-to-fine  control  architecture,  in  which  routing  is  at  first  performed 
by  a  small  number  of  control  neurons  on  a  low-pass  filtered  version  of  the 
image.  This  smaller  set  of  control  neurons  can  then  be  used  to  constrain 
the  activities  of  the  fine-grained,  locally  connected  control  neurons  that  are 
routing  the  high-resolution  information. 


Recognition  Once  the  window  of  attention  has  focused  on  a  blob,  the 
underlying  high-resolution  information  can  also  be  fed  through  the  rout¬ 
ing  circuit  and  into  the  input  of  an  associative  memory  for  recognition. 
However,  it  is  likely  that  the  initial  estimation  of  position  and  size  made 
by  routing  the  blob  was  only  approximately  correct,  and  this  may  cause 
problems  for  matching  the  high-resolution  information.  Thus,  it  would 
be  desirable  to  have  the  associative  memory  help  adjust  the  position  and 
scale  of  the  attentional  window  while  it  converges.  How,  then,  shall  the 
associative  memory  be  incorporated  into  the  control  of  the  routing  circuit? 

If  a  Hopfield  associative  memory  (Hopfield  1984)  is  used  for  recognition, 
we  can  replace  the  blob  search  objective  function,  EbIob,  with  the  associative 
memory's  objective  function,  Emem.  Normally,  the  only  dynamic  variables 
in  a  Hopfield  associative  memory  are  the  output  voltages,  Vu  which  evolve 
by  simply  following  a  monotonically  increasing  function  of  the  gradient 
of  the  energy 

Vi  =  #(«  D  (13.7) 

duf  _  dEmcm 
1  dt  dVi 


=  ET<ivi 


u 

t _ i  rrncm 

Ri  ' 


(13.8) 


where  T,y  denotes  the  connection  strength  between  neurons  i  and  j,  C/  and 
Ri  are  constants  that  determine  the  integration  time  constant  of  each  neu¬ 
ron,  and  g  is  a  squashing  function  (usually  tanh).  Since  the  inputs  of  the 
associative  memory,  J™em,  are  to  be  obtained  directly  from  the  outputs  of 
the  routing  circuit,  the  ck  now  become  additional  dynamic  variables  incor¬ 
porated  into  the  associative  memory's  energy  function.  By  letting  the  ck 
follow  the  gradient  of  the  energy,  along  with  the  Vit  the  combined  asso¬ 
ciative  memory/routing  circuit  should  relax  to  the  closest  stored  pattern 
and  to  the  correct  position  and  size  of  the  window  of  attention  simultane¬ 


ously.  A  dynamic  equation  for  c k  that  simultaneously  minimizes  both  the 
associative  memory's  energy  and  the  constraint  function  for  preserving 
topography  is  given  by 


ck  =  a(uk) 


(13.9) 


EE^wf^E^ 


(13.10) 


A  neural  circuit  for  computing  equations  (13.9)  and  (13.10)  is  shown  in 
figure  13.6.  The  first  term  on  the  right-hand  side  of  equation  (13.10)  is 
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Figure  13.6  An  autonomous  routing  circuit  for  recognition.  Each  node  of  the  associative 
memory  receives  its  external  input  from  an  output  node  of  the  routing  circuit.  Hence,  each 
node  of  the  associative  memory  has  dynamic  connections  to  many  input  nodes.  The  outputs 
of  the  associative  memory  are  then  fed  back  and  correlated  with  the  inputs  to  drive  the  control 
units. 


computed  by  correlating  the  inputs.  If1,  and  memory  outputs,  Vit  whose 
connection  pathways  are  influenced  by  control  unit  c*  (specified  by  r^). 
The  other  terms  are  computed  as  before.  Thus,  the  main  qualitative  dif¬ 
ference  between  this  circuit  and  the  "blob  finder"  (figure  13.5)  is  that  the 
control  is  guided  by  the  interaction  between  top-down  and  bottom-up  sig¬ 
nals  rather  than  purely  bottom-up  sources. 

Again,  since  the  equations  (13.9)  and  (13.10)  are  nonlinear  in  Ck,  the 
potential  exists  for  the  routing  circuit  to  get  stuck  in  local  minima.  This 
problem  can  be  overcome  in  a  similar  manner  as  outlined  previously  by 
having  the  associative  memory  and  routing  circuit  work  at  varying  levels  of 
resolution.  Matching  would  first  be  performed  at  low  resolution,  and  this 
information  would  then  be  used  to  constrain  the  matching  at  progressively 
higher  resolutions. 

Shifting  Attention  to  the  Next  Object  Once  an  object  has  been  recog¬ 
nized,  the  window  of  attention  should  move  on  to  another  interesting  part 
of  the  scene.  One  way  this  could  be  accomplished  would  be  for  the  control 
units  to  be  self-inhibited  through  a  delay.  Thus,  when  a  group  of  control 
units  is  active  for  some  time  (long  enough  for  recognition  to  take  place)  it 
should  begin  to  shut  off.  This  would  allow  other  blobs  or  interesting  items 
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to  compete  successfully  for  control  of  the  window  of  attention  (see  also 
Koch  and  Ullman  1985). 

Computer  Simulation  Figure  13.7  shows  the  results  of  a  computer  sim¬ 
ulation  of  a  simple  attentional  system  for  recognizing  objects,  based  on  the 
principles  elucidated  above.  The  network  begins  in  blob  search  mode,  at¬ 
tempting  to  fill  the  output  of  the  routing  circuit  with  something  interesting. 
In  figure  13.7a  the  network  has  settled  on  the  A,  since  it  has  the  greatest 
overall  brightness  in  the  input.  (Since  the  shapes  used  in  this  example 
are  so  compact  and  simple  we  have  bypassed  the  step  of  prefiltering  them 
into  blobs.  Thus,  during  blob  search,  an  object  is  low-pass  filtered  by  the 
routing  circuit  itself.)  After  settling  on  a  potentially  interesting  object,  the 
network  is  switched  into  recognition  mode  and  the  output  of  the  routing 
circuit  is  fed  to  an  associative  memory.  Two  patterns — A  and  C — have 
been  previously  stored  in  the  associative  memory.  The  blurred  version  of 
the  object  initially  drives  the  inputs  of  the  associative  memory  to  begin 
the  pattern  search.  If  the  position  of  the  window  of  attention  is  slightly 
off,  the  blurred  version  of  the  object  is  not  affected  much  and  still  sends 
the  memory  searching  in  the  correct  direction.  As  the  associative  memory 
converges,  control  units  compute  the  correlation  between  outputs  and  in¬ 
puts  and  set  their  activation  correspondingly.  This  tends  to  maximize  the 
similarity  between  the  outputs  of  the  memory  and  the  outputs  of  the  rout¬ 
ing  circuit,  which  will  also  refine  the  position  of  the  attentional  window 
so  that  the  high-resolution  components  can  be  properly  matched  (figure 
13.7b).  After  allowing  a  fixed  amount  of  time  for  the  associative  memory  to 
converge  (another  time  constant  or  two),  the  simulation  states  the  position 
and  presumed  identity  of  the  object.  The  current  control  state  is  then  self- 
inhibited  and  the  network  switches  back  into  blob  search  mode.  This  then 
puts  the  next  interesting  object  at  a  competitive  advantage  in  attracting  the 
window  of  attention  so  that  it  may  also  be  recognized  (figure  13.7c, d). 

This  simulation  demonstrates  the  operation  of  simple,  neural-like  cir¬ 
cuits  for  routing  and  control  in  both  "preattentive"  (blob  search)  and  "at¬ 
tentive"  (recognition)  modes.  Although  we  have  greatly  oversimplified 
matters  for  the  purpose  of  explanation,  the  same  basic  principles  can  be 
extended  to  larger,  hierarchical  circuits.  We  now  turn  to  the  issue  of  how 
such  circuits  may  possibly  be  implemented  in  the  brain. 

Neurobiological  Substrates  and  Mechanisms 

Figure  13.8a  illustrates  the  major  visual  processing  pathways  of  the  primate 
brain.  Information  from  the  retinogeniculostriate  pathway  enters  the  vi¬ 
sual  cortex  through  area  VI  in  the  occipital  lobe  and  proceeds  through  a 
hierarchy  of  visual  areas  that  can  be  subdivided  into  two  major  functional 
streams  (Ungerleider  and  Mishkin  1982).  The  so-called  "form"  pathway 
leads  ventrally  through  V4  and  inferotemporal  cortex  (IT)  and  is  mainly 
concerned  with  object  identification,  regardless  of  position  or  size.  The  so- 
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Figure  13.7  Computer  simulation  of  a  simple  attentional  system  for  recognizing  objects. 
The  input  to  the  routing  circuit  consists  of  a  22  x  22  pixel  array  of  sample  nodes  and  the 
output  of  the  routing  circuit  is  an  8  x  8  array  of  sample  nodes.  There  are  three  sets  of  control 
units,  each  one  corresponding  to  a  different  size  of  the  window  of  attention  (small  [8  x  8], 
medium  [11  x  11],  and  large  [16  x  16]).  Each  control  neuron  within  a  set  corresponds  to 
particular  position  of  the  window  of  attention.  The  Hopfield  associative  memory  network 
("Mem  output,"  see  figure  13.6)  is  composed  of  64  units,  fully  interconnected  and  arranged 
into  an  8  x  8  grid  (i.e.,  one  node  for  each  output  of  the  routing  circuit).  The  dashed  outline 
denotes  the  position  and  size  of  the  window  of  attention,  (a)  In  blob  search  mode,  the  network 
settles  on  the  A,  since  it  has  the  greatest  overall  brightness,  (b)  The  network  is  then  switched 
into  recognition  mode  and  settles  on  the  identification  of  the  object.  The  position  and  size  of 
the  object  are  encoded  in  the  activities  of  the  control  neurons.  After  a  fixed  amount  of  time, 
the  current  control  state  is  self-inhibited  and  the  network  is  switched  back  into  blob  search 
mode  so  that  the  next  most  interesting  object  may  be  recognized  (c,  d). 

called  "where"  pathway  leads  dorsally  into  the  posterior  parietal  complex 
(PP),  and  seems  to  be  concerned  with  the  locations  and  spatial  relation¬ 
ships  among  objects,  regardless  of  their  identity.  The  pulvinar,  a  subcorti¬ 
cal  nucleus  of  the  thalamus,  makes  reciprocal  connections  with  all  of  these 
cortical  areas  (cf.  Robinson  and  Petersen  1992).  The  following  subsections 
describe  how  we  envision  the  dynamic  routing  circuit  mapping  onto  this 
collection  of  neural  hardware. 

Cortical  Areas  Figure  13.8b  shows  the  scaled-up  routing  circuit  that  we 
propose  as  a  model  of  attentional  processing  in  visual  cortex.  The  different 
stages  of  the  network  correspond  to  the  major  cortical  areas  in  the  "form" 
pathway.  There  are  two  stages  for  VI:  Via  corresponding  to  layer  4C,  and 
Vlb  corresponding  to  superficial  layers,  because  VI  has  about  twice  the 
density  of  neurons  per  unit  surface  area  as  the  rest  of  neocortex  (O'Kusky 
and  Colonnier,  1982).  The  remaining  areas — V2,  V4,  and  inferotemporal 
cortex  (IT) —  occupy  one  stage  apiece.  Each  node  represents,  in  the  simplest 
sense,  a  sample  of  image  luminance.  More  realistically,  each  node  would 
correspond  to  a  feature  vector  that  is  represented  by  the  activity  profile 
of  a  large  group  (hundreds  or  thousands)  of  neurons  in  each  visual  area. 
For  example,  in  VI,  each  group  would  include  cells  selective  for  various 
orientations,  spatial  frequencies,  etc.  in  a  small  region  of  visual  space.  It 
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Occipital  lobe 


Figure  13.8  Neurobiological  substrates  of  the  model,  (a)  The  major  visual  processing  path¬ 
ways  of  the  primate  brain.  Some  connection  pathways  (e.g.,  V4-PP)  are  not  shown  to  avoid 
clutter.  ( b )  Proposed  neuroanatomical  substrates  for  dynamic  routing.  The  label  beside  each 
layer  indicates  the  corresponding  cortical  area  and  the  number  of  sample  nodes  in  one  di¬ 
mension.  The  number  of  sample  nodes  in  two  dimensions  is  approximately  the  square  of  this 
number  (of  course,  there  would  be  many  neurons  at  each  node  in  any  model  that  represented 
complex  features  rather  than  pixels.)  At  the  bottom  is  shown  a  scale  of  the  approximate 
eccentricity  of  the  input  nodes  to  the  circuit.  Connections  are  shown  for  the  middle  node  in 
each  layer.  (Individual  nodes  are  indistinguishable  here  because  of  their  density.)  Control 
signals  originate  from  the  pulvinar  to  effectively  gate  the  feedforward  synapses. 
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is  impractical  at  this  stage  to  include  these  characteristics  explicitly  in  our 
model,  but  we  contend  that  these  details  can  be  safely  neglected  for  now 
without  losing  the  predictive  value  of  the  model. 

The  size  of  each  stage  scales  roughly  with  the  relative  size  of  each  corre¬ 
sponding  cortical  area.  (A  notable  exception,  of  course,  is  the  IT  complex, 
due  to  the  fact  that  much  of  the  neural  resources  in  IT  are  probably  de¬ 
voted  to  recognition  rather  than  representing  the  contents  of  the  window 
of  attention  itself.)  In  addition,  the  fan-in  per  node  (~1000:1)  and  resulting 
receptive  field  sizes  are  consistent  with  neuroanatomical  and  neurophys¬ 
iological  data  (Gattass  et  al.  1985;  Cherniak  1990;  Douglas  and  Martin 
1990).  The  number  of  nodes  in  the  output  (~  30  nodes  in  diameter)  corre¬ 
sponds  to  the  spatial  resolution  of  the  window  of  attention  in  our  model, 
which  seems  to  be  consistent  with  several  lines  of  psychophysical  evidence 
(Campbell  1985;  Van  Essen  et  al.  1991). 

A  number  of  physiological  experiments  indicate  that  the  posterior  pari¬ 
etal  complex  (PP)  may  be  representing  the  locations  of  potential  attentional 
targets  in  the  visual  scene  (Mountcastle  et  al.  1981;  Robinson  et  al.  1991; 
Steinmetz  et  al.  1992).  For  this  reason,  we  propose  that  PP  may  act  as  a 
"saliency  map”  (e.g.,  Koch  and  Ullman  1985),  analogous  to  the  blob  map 
utilized  in  the  simple  attentional  system  described  previously.  The  supe¬ 
rior  colliculus  may  also  supplement  PP  in  this  role  by  acting  as  a  crude 
saliency  map,  but  with  a  quicker  response  time  due  to  its  direct  retinal 
input.  These  neurons  would  then  drive  the  control  neurons  that  compete 
for  control  of  the  window  of  attention. 

Subcortical  Areas  We  hypothesize  that  the  pulvinar  plays  an  important 
role  in  providing  the  control  signals  required  for  the  routing  circuit  (see 
also  chapter  5  by  Koch  and  Crick).  The  pulvinar  is  reciprocally  connected 
to  all  areas  in  the  form  pathway,  thus  making  it  a  plausible  candidate  for 
modulating  information  flow  from  VI  to  IT.  In  addition,  the  pulvinar  re¬ 
ceives  projections  from  both  PP  and  superior  colliculus,  which  are  known 
to  encode  the  direction  of  saccade  targets  and  may  also  be  involved  in 
setting  up  attentional  targets  (Posner  and  Petersen  1990).  Finally,  neuro¬ 
physiological  studies  (Petersen  et  al.  1985),  lesion  studies  (Desimone  et 
al.  1990),  and  PET  studies  (LaBerge  and  Buchsbaum  1990)  of  the  pulvinar 
suggest  that  it  plays  an  important  role  in  visual  attention. 

A  subcortical  nucleus  such  as  the  pulvinar  also  has  the  important  prop¬ 
erty  of  being  spatially  localized  while  at  the  same  time  being  able  to  com¬ 
municate  with  vast  areas  of  the  visual  cortex.  The  relative  proximity  of 
pulvinar  neurons  to  each  other  would  facilitate  the  competitive  and  coop¬ 
erative  interactions  among  the  control  neurons  that  are  necessary  to  en¬ 
force  the  constraint  of  having  a  single  window  of  attention.  Intrapulvinar 
communication  could  possibly  be  subserved  by  intemeurons  within  the 
pulvinar  (Ogren  and  Hendrickson  1979)  or  through  the  reticular  nucleus 
of  the  thalamus  (Conley  and  Diamond  1990). 

Although  it  is  difficult  to  estimate  exactly  how  many  control  neurons 
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would  be  required  for  a  cortical  routing  circuit,  we  estimate  that  roughly 
106  neurons  would  suffice,  which  is  in  the  range  of  the  total  number  of 
neurons  we  estimate  for  the  pulvinar  (see  Olshausen  et  al.  1993). 

Gating  Mechanisms  Neural  gating  mechanisms  are  believed  to  play  an 
important  role  in  many  aspects  of  nervous  system  function.  For  example, 
the  extent  to  which  a  noxious  stimulus  is  perceived  as  painful  varies  greatly 
as  a  function  of  one's  emotional  state  and  other  external  factors.  This  is 
subserved,  at  least  in  part,  by  gating  mechanisms  in  the  spinal  cord,  where 
descending  fibers  from  the  raphe  nuclei  form  part  of  a  control  system  that 
modulates  pain  transmission  via  presynaptic  inhibition  in  the  dorsal  horn 
(Fields  and  Basbaum  1978).  Gating  mechanisms  are  also  thought  to  play  an 
important  role  in  sensorimotor  coordination;  for  example,  central  pattern 
generators  in  the  spinal  cord  are  known  to  gate  sensory  inputs  according 
to  the  phase  of  the  movement  cycle  in  which  the  input  occurs  (Sillar  1991). 
A  somewhat  different  form  of  gating  seems  to  take  place  in  the  LGN,  in 
which  thalamic  relay  cells  seem  to  exhibit  two  distinct  response  modes:  a 
relay  mode,  in  which  cells  tend  to  more  or  less  faithfully  replicate  retinal 
input,  and  a  nonrelay  burst  mode,  in  which  cells  burst  in  a  rhythmic  pattern 
that  bears  little  resemblance  to  the  retinal  input  (Sherman  and  Koch  1986). 
In  this  instance,  the  reticular  nucleus  of  the  thalamus  is  thought  to  be  the 
source  of  the  signal  that  switches  the  LGN  into  the  nonrelay  burst  mode. 

Although  there  is  as  yet  no  explicit  evidence  for  gating  mechanisms 
in  the  visual  cortex,  there  are  several  possible  biophysical  mechanisms 
that  would  allow  control  neurons  to  gate  synapses  along  the  Vl-IT  path¬ 
way.  Presynaptic  inhibition,  as  in  the  spinal  cord,  would  provide  the  most 
localized  gating  effect.  However,  to  date  there  exists  no  morphological 
evidence  for  this  type  of  mechanism  (Berman  et  al.  1992).  Postsynapti- 
cally,  a  control  neuron  could  decrease  or  possibly  nullify  the  efficacy  of 
a  corticocortical  synapse  via  shunting  inhibition.  Evidence  for  this  type 
of  mechanism  playing  a  role  in  orientation  or  direction  tuning  is  mixed, 
with  some  for  (Volgushev  et  al.  1992;  Pei  et  al.  1992)  and  some  against 
(Douglas  et  al.  1988).  Another  possible  postsynaptic  gating  mechanism 
could  be  realized  via  the  combined  voltage-  and  ligand-gated  NMDA  re¬ 
ceptor  channel,  which  has  been  shown  to  play  an  important  role  in  normal 
visual  function  (Nelson  and  Sur  1992;  Miller  et  al.  1989).  In  this  case,  a 
neuron  could  effectively  boost  the  gain  of  a  corticocortical  synapse  by  lo¬ 
cally  depolarizing  the  membrane  in  the  vicinity  of  the  synapse.  Also,  there 
exist  voltage-gated  Ca2+  channels  in  dendrites  that  could  provide  nonlin¬ 
ear  coupling  between  inputs  (LlinSs,  1988a).  All  of  these  mechanisms,  and 
possibly  others,  offer  a  multiplicative-type  effect  that  is  suitable  for  gating 
information  flow  through  the  cortex  (see  also  Koch  and  Poggio  1992). 

From  a  computational  viewpoint,  gating  inputs  within  the  dendrites 
provides  a  much  higher  degree  of  flexibility  than  would  merely  gating  the 
outputs  of  pyramidal  cells.  Since  the  output  of  a  pyramidal  cell  may  branch 
to  several  cortical  areas  and  make  synaptic  connections  to  a  multitude  of 
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neurons,  any  modulation  of  the  cell's  output  will  simply  be  duplicated 
at  all  these  input  points.  Gating  inputs  within  the  dendrites,  on  the  other 
hand,  allows  many  intermediate  results,  cfcr2yfcljn,  to  be  computed  within 
the  postsynaptic  membrane  and  then  summed  together  within  a  single 
cell.  This  results  in  a  computational  structure  far  richer  and  more  compact 
(Mel  1992),  and  provides  a  higher  degree  of  flexibility  in  remapping  visual 
information.  We  believe  the  demonstrable  computational  advantage  of 
dendritic  gating  mechanisms  for  visual  processing  motivates  the  need  to 
specifically  look  for  such  mechanisms  experimentally. 

Predictions 


Neurophysiology  The  most  obvious  prediction  of  the  dynamic  routing 
circuit  model  is  that  the  receptive  fields  of  cortical  neurons  should  change 
their  position  or  size  as  attention  is  shifted  or  rescaled.  This  effect  should 
be  especially  pronounced  in  higher  cortical  areas.  Moran  and  Desimone 
(1985)  found  that  receptive  fields  of  neurons  in  areas  V4  and  IT  of  primate 
visual  cortex  seem  to  be  dynamically  modulated  so  that  unattended  stim¬ 
uli  have  a  reduced  effect  on  the  cells  response,  even  though  they  lie  within 
the  classical  receptive  field.  This  result  is  consistent  with  the  prediction  of 
the  model,  as  explained  graphically  in  figure  13.9.  The  stronger  prediction 
of  the  model — that  receptive  fields  should  shift  and  rescale  proportional  to 
the  position  and  size  of  the  attentional  window — needs  to  be  tested  by  ex¬ 
plicitly  mapping  out  receptive  fields  under  different  attentional  conditions. 

Since  we  hypothesize  that  pulvinar  neurons  control  the  remapping  pro¬ 
cess,  we  would  predict  that  lesions  to  the  pulvinar  should  dramatically 
degrade  attentional  and  pattern  recognition  abilities.  Neurophysiological 
data  thus  far  indicate  that  pulvinar  lesions  do  indeed  degrade  attentional 
capabilities,  but  tests  of  pattern  recognition  capabilities  (e.g.,  Chalupa  et  al. 
1976)  have  used  such  simple  stimuli  that  it  is  difficult  to  discern  to  what  ex¬ 
tent  detailed  spatial  recognition  is  affected.  One  would  also  expect  to  find 
some  form  of  enhancement  in  the  response  of  pulvinar  neurons  projecting 
to  those  areas  of  the  cortex  within  the  topographic  vicinity  of  the  attentional 
beam.  Petersen  et  al.  (1985)  reported  such  an  enhancement  effect  for  neu¬ 
rons  in  the  dorsomedial  portion  of  the  pulvinar  (which  is  connected  with 
PP),  but  not  in  the  inferior  or  lateral  portion  (which  is  connected  to  Vl-IT). 
The  lack  of  enhancement  here  may  be  due  to  the  fact  that  the  task  used  in 
this  experiment  was  very  simple  (detecting  the  dimming  of  a  spot  of  light). 


Neuroanatomy  Our  particular  routing  circuit  model  predicts  that  the  size 
of  the  cortical  region  from  which  a  cell  receives  its  input  should  increase 
by  roughly  a  factor  of  two  at  each  stage  in  the  hierarchy  of  visual  areas  in 
the  form  pathway  (see  figure  13.8).  While  there  exists  some  evidence  in 
^support  of  this  prediction — for  example,  projections  from  V4  to  IT  are  more 
diffuse  than  projections  from  VI  to  V2  (Van  Essen  et  al.  1986;  DeYoe  and 
Sisola  1991;  see  also  Rockland  1992b)— more  accurate  and  higher  resolution 
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Figure  13.9  The  dynamic  routing  circuit  interpretation  of  the  Moran  and  Desimone  (1985) 
experiment.  The  node  in  layer  V4  indicates  the  cell  under  scrutiny.  The  hatched  region  indi¬ 
cates  those  connections  to  the  cell  that  are  enabled;  the  others  are  disabled.  The  bounds  of  the 
window  of  attention  in  each  area  are  shown  by  the  stippled  lines,  (a)  In  the  nonattentive  state, 
all  connections  will  be  open  and  the  effective  stimulus  can  excite  the  cell  anywhere  within 
its  classical  receptive  field.  ( b )  When  attending  to  the  effective  stimulus,  the  cell's  response 
should  be  unaltered  since  the  neural  pathways  to  the  stimulus  are  still  open,  (c)  When  at¬ 
tending  to  the  ineffective  stimulus,  the  cell's  response  should  decrease  substantially  since  the 
neural  pathways  to  the  effective  stimulus  are  gated  out.  ( A )  When  attending  outside  the  cell's 
classical  receptive  field,  there  is  no  need  to  gate  the  cell's  inputs  since  it  is  no  longer  taking 
part  in  the  process  of  routing  information  within  the  window  of  attention. 


data  are  needed  to  confirm  or  contradict  the  specific  architecture  of  the 
proposed  routing  circuit.  The  model  also  predicts  that  pulvinar  afferents 
should  terminate  in  the  cortex  in  such  a  way  that  they  could  effectively 
modulate  intercortical  synaptic  strengths.  Neuroanatomical  studies  thus 
far  seem  to  be  in  agreement  with  this  prediction  (e.g.,  Trojanowski  and 
Jacobson  1976),  but  it  would  be  of  interest  to  know  if  the  pulvinar  afferents 
make  contact  with  inhibitory  interneurons  or  directly  onto  the  dendrites  of 
pyramidal  cells.  If  the  latter  is  true,  it  would  be  of  interest  to  know  whether 
these  synapses  are  made  near  corticocortical  synapses.  Finally,  the  model 
predicts  that  there  should  exist  lateral  interconnections  among  pulvinar 
neurons  to  constrain  their  activity  to  be  consistent  with  a  single  position  and 
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size  of  the  window  of  attention.  This  is  partially  supported  by  the  existence 
of  intemeurons  within  the  pulvinar  (Ogren  and  Hendrickson  1979),  but  it 
remains  to  be  seen  if  the  axons  of  projection  neurons  have  collaterals  that 
spread  horizontally  within  the  pulvinar,  or  to  what  extent  the  reticular 
nucleus  of  the  thalamus  may  subserve  intrapulvinar  communication. 

Psychophysics  The  number  of  sample  nodes  in  the  top  layer  of  the  rout¬ 
ing  circuit  implies  that  the  spatial  resolution  of  the  window  of  attention 
is  limited  to  a  diameter  of  about  30  pixels.  Although  this  estimate  is  con¬ 
sistent  with  several  lines  of  psychophysical  evidence,  including  studies  of 
spatial  acuity,  contrast  sensitivity  to  gratings,  and  recognition  (Campbell 
1985;  Van  Essen  et  al  1991),  none  of  these  studies  was  actually  directed  at 
studying  visual  attention.  Most  of  the  experiments  had  long  display  times 
that  could  conceivably  have  allowed  several  attentional  fixations  (although 
we  doubt  that  this  would  have  been  a  major  contaminating  factor  in  most 
cases).  One  possible  approach  to  testing  this  prediction  more  thoroughly 
would  be  to  test  pattern  discrimination  ability  as  a  function  of  the  position, 
size,  and  resolution  of  the  object.  Assuming  a  subject  could  be  properly  pre¬ 
cued  to  attend  to  a  certain  position  and  size  of  the  visual  field,  and  that  dis¬ 
play  times  were  limited  to  the  order  of  50  msec,  we  would  predict  that  per¬ 
formance  would  drop  off  sharply  once  the  task-specific  spatial  frequency 
content  of  the  stimulus  exceeded  approximately  15  cycles  across  the  object. 

The  model  also  suggests  that  once  a  location  has  been  attended  to  in 
the  visual  field,  it  should  be  difficult  to  stay  there  or  immediately  revisit  it 
since  the  the  control  neurons  corresponding  to  that  part  of  the  visual  field 
are  currently  inhibited  from  firing.  This  is  consistent  with  the  psychophys¬ 
ical  observation  that  involuntary  attentional  fixations  tend  to  be  transient 
(Nakayama  and  Mackeben  1989)  and  appear  to  be  inhibited  from  return 
(Posner  and  Cohen  1984). 

Recognition  of  Highly  Complex  Patterns  The  current  version  of  our 
dynamic  routing  model  is  capable  of  subserving  translation  and  scale- 
invariant  pattern  discrimination  when  tested  with  small  numbers  of  rela¬ 
tively  simple  stimuli  (see  figure  13.7).  These  rudimentary  capabilities  are  a 
far  cry  from  those  of  our  own  visual  system,  which  can  effortlessly  discrimi¬ 
nate,  for  example,  among  hundreds  or  thousands  of  human  faces,  indepen¬ 
dent  of  size,  position,  and  viewing  angle.  Besides  quantitatively  scaling  to 
a  higher  density  representation,  two  types  of  qualitative  refinement,  both 
alluded  to  already,  will  be  needed  for  our  model  to  more  closely  approach 
human  performance.  First,  there  needs  to  be  a  much  more  sophisticated 
control  structure  that  provides  for  warping  of  image  representations,  in 
addition  to  the  translational  and  scaling  capacities  already  achieved  in  our 
current  model.  Second,  substantial  processing  of  form  and  textural  cues 
should  be  carried  out  at  early  and  intermediate  stages,  rather  than  hav¬ 
ing  only  pixel-based  information  transmitted  to  the  high-level  associative 
memory.  For  example,  it  might  be  sensible  to  have  cells  at  the  high  levels 
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Figure  13.10  The  importance  of  preserving  spatial  relationships.  The  two  objects  (a)  and  (b) 
contain  the  same  spatial  features  but  result  in  very  different  percepts. 


of  the  routing  circuit,  just  before  the  associative  memory  stage  (cf.  figure 
13.2),  that  have  attained  selectivity  for  complex  local  features,  such  as  the 
shapes  of  particular  parts  of  the  face  (e.g.,  the  eyes,  nose,  or  mouth).  In¬ 
deed,  cells  with  characteristics  of  this  type  have  been  reported  in  posterior 
inferotemporal  cortex  of  anesthetized  monkeys  (Fujita  et  al.  1992).  A  cru¬ 
cial  (but  untested)  aspect  of  our  model  is  that  such  cells  should  also  be 
tuned  for  the  location  of  features  within  the  window  of  attention,  thereby 
preserving  an  explicit  representation  of  local  spatial  relationships. 

To  illustrate  how  this  strategy  might  work,  consider  the  example  of  two 
readily  distinguishable  cartoon  faces  that  are  made  up  of  identical  ele¬ 
ments,  one  in  a  natural  configuration  (figure  13.10a)  and  the  other  in  a 
spatially  scrambled  configuration  where  the  positions  of  the  eyes,  nose, 
and  mouth  are  interchanged  (figure  13.10b).  When  one  attends  to  the  face 
as  a  whole,  we  propose  that  each  pattern  activates  very  different  popu¬ 
lations  of  the  aforementioned  feature-selective  neurons  in  posterior  infer¬ 
otemporal  cortex,  even  though  they  consist  of  identical  components.  For 
example,  the  normal  face  would  activate  cells  that  are  selective  for  eyes 
situated  in  the  upper  part  of  the  window  of  attention,  whereas  the  scram¬ 
bled  face  would  activate  a  different  set  of  cells  selective  for  eyes  situated 
in  the  middle  and  lower  part  of  the  window.  (The  former  cells  might  be 
more  numerous  than  the  latter  as  a  result  of  biased  exposure  in  normal 
visual  experience.)  In  each  case,  the  inputs  would  feed  into  an  associative 
memory,  where  they  would  elicit  different  patterns  of  activity  and  hence 
different  visual  percepts.  We  contend  that  the  preservation  of  information 
about  local  spatial  relationships  up  to  (but  not  including)  the  associative 
memory  stage  provides  an  efficient  basis  for  discriminating  among  a  wide 
variety  of  complex  natural  objects. 

One  possible  concern  about  this  proposed  strategy  is  the  requirement 
for  a  large  number  of  cells,  because  these  would  need  to  be  tuned  for  many 
different  positions  of  each  featural  type  relative  to  the  boundaries  of  the 
window  of  attention.  This  difficulty  could  be  minimized  by  having  rea¬ 
sonably  broad  tuning  for  position  as  well  as  featural  characteristics.  In 
any  event,  similar  issues  must  be  addressed  by  any  model  that  is  capable 
of  complex  pattern  recognition.  For  example,  one  alternative  would  be  to 
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have  cells  that  are  explicitly  tuned  for  the  distance  between  particular  fea¬ 
tures  as  well  as  the  orientation  of  the  axis  between  them  (F.  Crick,  personal 
communication).  To  discriminate  between  figure  13.10a  and  13.10b  using 
this  strategy,  it  would  be  important  to  have  cells  that  are  selective  both  for 
particular  feature  conjunctions  and  for  certain  positional  relationships  (e.g, 
a  nose-like  feature  that  is  a  particular  distance  directly  above  a  mouth-like 
feature).  We  contend  that  this  strategy  would  lead  to  an  even  more  severe 
combinatorial  explosion  in  the  numbers  of  cells  needed  to  encode  all  of  the 
requisite  combinations  of  featural  and  positional  cues. 

Comparison  with  Other  Network  Models 

Control  vs.  Synchronicity  Recently,  there  has  been  widespread  interest 
in  the  possibility  that  synchrony  of  neural  firing  could  serve  as  a  code  for 
linking  features  common  to  a  given  object.  Synchronous  activation  could 
operate  by  transient  increases  in  connection  strengths  (Crick  1984;  von  der 
Malsburg  and  Bienenstock  1986).  In  this  way,  temporal  information  might 
be  used  to  solve  the  "binding  problem"  (see  chapter  10  by  Singer)  and 
thereby  mediate  aspects  of  figure/ground  segregation,  attention,  and  per¬ 
haps  even  consciousness  (see  chapter  5  by  Koch  and  Crick).  A  potential 
weakness  common  to  these  approaches  is  that  information  about  what  is 
being  connected  to  what  at  any  instant  in  time  is  not  explicitly  encoded 
anywhere  in  the  system.  In  our  model,  this  information  is  encoded  ex¬ 
plicitly  in  the  activities  of  the  control  neurons,  which  then  allows  it  to  be 
utilized  advantageously  in  a  number  of  ways. 

One  way  that  information  about  connectivity  can  be  utilized  is  in  con¬ 
straining  the  active  connections  between  retinal  and  object-based  reference 
frames  to  be  in  accordance  with  a  global  shift  and  scale  transformation. 
This  constraint  is  incorporated  in  our  model  via  the  competitive  and  coop¬ 
erative  interactions  among  the  control  neurons.  During  object  recognition, 
this  constraint  drastically  reduces  the  number  of  degrees  of  freedom  in 
matching  points  between  the  retinal  and  object-centered  reference  frames, 
because  once  a  few  point-to-point  correspondences  have  been  established, 
the  number  of  potential  matches  between  other  pairs  of  points  is  greatly 
reduced.  Researchers  in  machine  vision  have  termed  this  the  viewpoint 
consistency  constraint ,  and  it  has  proved  to  be  a  powerful  computational 
strategy  for  object  recognition  systems  (Hinton  1981b;  Lowe  1987). 

Another  advantage  of  having  information  about  active  connection  states 
readily  available  is  that  the  ensemble  of  control  neurons  together  forms  a 
neural  code  for  the  current  position  and  size  of  the  window  of  attention. 
Therefore,  the  position  and  size  of  an  object  can  be  inferred  by  simply 
reading-out  the  state  of  the  control  neurons.  In  addition,  it  would  also  be 
possible  for  the  control  neurons  to  warp  the  reference  frame  transformation 
to  form  object  representations  that  are  invariant  to  distortion  (e.g.,  hand¬ 
written  digits),  in  which  case  information  about  the  particular  shape  of 
the  object  (e.g.,  its  slant  or  style)  could  also  be  preserved.  Note  that  such 
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information  is  typically  lost  in  networks  that  utilize  feature  hierarchies 
of  complex  cells  (Fukushima  1980,  1987;  LeCun  et  al.  1990)  or  Fourier 
transforms  (e.g..  Pollen  et  al.  1971;  Cavanagh  1978,  1985)  for  forming 
position-,  scale-,  and/or  distortion-invariant  representations. 

One  final  advantage  of  having  control  explicitly  represented  is  that  it 
allows  attention  to  be  easily  directed  "at  will,"  or  by  other  modalities,  since 
those  areas  of  the  brain  that  have  access  to  the  control  neurons  (such  as 
parietal  cortex)  can  directly  influence  where  attention  is  directed.  This  also 
provides  a  convenient  format  for  mediating  the  access  to  control  among 
various  competing  demands. 

An  interesting  issue  is  whether  the  control  strategy  advocated  in  our 
model  could  be  implemented  using  oscillations  or  temporal  synchrony 
rather  than  the  multiplicative  synaptic  interactions  suggested  in  an  ear¬ 
lier  section.  While  we  cannot  rule  this  out,  we  find  it  difficult  to  envision 
precisely  how  such  a  model  could  route  information  flow  within  the  cor¬ 
tex  while  simultaneously  preserving  local  spatial  relationships  within  the 
attentional  window.  Alternatively,  oscillations  might  provide,  by  anal¬ 
ogy  to  digital  computers,  a  clocking  signal  to  control  the  precise  timing  of 
switches  in  information  flow. 

Control-Based  Network  Models  A  number  of  other  network  models  of 
attention  have  also  utilized  the  concept  of  control  neurons  for  directing 
information  flow.  Niebur  et  al.  (1993),  Ahmad  (1992),  Tsotsos  (1991), 
and  Mozer  and  Behrmann  (1990),  among  others,  have  proposed  various 
schemes  for  selecting  and  routing  information  from  a  select  portion  of  the 
visual  scene.  However,  none  of  these  models  explicitly  preserves  spa¬ 
tial  relationships  within  the  window  of  attention,  which  we  consider  to 
be  a  critical  component  of  the  routing  process.  Hinton  (1981a),  Hinton 
and  Lang  (1985),  and  Sandon  (1990)  have  proposed  control-based  models 
that  do  preserve  spatial  relationships  within  the  window  of  attention  and 
share  the  same  basic  flavor  as  the  model  presented  here  (i.e.,  remapping 
object  representations  from  retinal  into  object-centered  reference  frames). 
Although  these  latter  models  attempt  to  model  psychophysical  data,  we 
feel  that  they  lack  the  necessary  level  of  neurobiological  detail  to  give  them 
strongly  predictive  value  in  biology. 

Recently,  Postma  et  al.  (1992)  proposed  a  neural  model  based  on  the  orig¬ 
inal  shifter  circuit  proposal  (Anderson  and  Van  Essen  1987)  to  account  for 
translational  invariance  in  visual  object  priming  (Biederman  and  Cooper 
1992).  This  model  shares  many  similarities  to  the  model  presented  here, 
including  top-down,  or  template-driven  control,  although  it  differs  in  the 
specifics  of  the  control  structure. 

Ullman  (see  chapter  12  by  Ullman)  has  proposed  that  pattern  recogni¬ 
tion  is  achieved  by  a  "counterstreams"  strategy,  in  which  information  about 
stored  patterns  flows  top-down  at  the  same  time  that  information  about 
currently  viewed  patterns  flows  in  the  bottom-up  direction.  Multiple  coex¬ 
isting  representations  flow  in  each  direction,  and  recognition  is  manifested 
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by  a  winner-take-all  competition  to  find  the  best  match  between  patterns 
propagating  in  the  two  directions.  His  model  shares  with  ours  the  notion 
that  explicit  control  processes  are  needed  to  regulate  information  flow. 
However,  these  processes  are  different  in  many  respects,  particularly  with 
regard  to  the  multiplicity  of  actively  propagated  stimulus  representations. 

Lastly,  there  are  important  lines  of  convergence  and  divergence  in  com¬ 
paring  our  model  to  the  ideas  of  Mumford  (see  chapter  7  by  Mumford) 
that  are  based  on  Grenander's  Pattern  Theory  (Grenander  1976-81).  He 
invokes  the  need  for  "domain  warping"  that  can  be  mediated  by  neural 
shifter  circuits  analogous  to  those  we  have  proposed. 

A  MODEL  OF  DIRECTED  MOTOR  CONTROL 

Computational  Framework 

In  the  introduction,  we  argued  that  directed  motor  control  and  directed 
visual  attention  share  common  computational  underpinnings,  in  terms  of 
requirements  for  a  high  degree  of  anatomical  convergence  and  divergence 
and  for  mechanisms  for  dynamically  controlling  information  flow.  These 
commonalities  are  illustrated  schematically  in  figure  13.11.  As  noted  al¬ 
ready,  visual  attention  involves  the  transient  selection  of  a  tiny  subset  of 
the  information  being  transmitted  along  the  optic  nerve.  In  the  preceding 
section,  our  model  included  only  a  single  high-level  center  mediating  all 
aspects  of  pattern  recognition.  In  reality,  though,  there  may  well  be  distinct 
neural  populations  responsible  for  qualitatively  different  types  of  pattern 
recognition.  For  example,  the  population  of  cells  responsible  for  recogniz¬ 
ing  faces  may  be  largely  or  entirely  different  from  those  responsible  for  rec¬ 
ognizing  alphanumeric  characters  or  from  those  responsible  for  recogniz¬ 
ing  different  flowers  and  trees.  If  so,  there  needs  to  be  an  output  selection 
process  for  determining  the  target  population  to  which  attended  informa¬ 
tion  is  sent  in  addition  to  the  aforementioned  input  selection  process  for  di¬ 
recting  the  position  and  scale  of  the  window  of  attention  (figure  13.11a;  see 
also  figure  5  in  Anderson  and  Van  Essen  1987).  For  simplicity,  we  assume 
here  that  the  different  target  populations  can  be  represented  by  separate  en¬ 
tries  at  the  top,  even  if  they  happen  to  overlap  physically  within  the  cortex. 

For  directed  motor  control,  there  needs  to  be  a  cognitively  driven  input 
selection  process  to  generate  the  appropriate  motor  routine  and  an  output 
selection  process  that  determines  to  which  appendage  or  combination  of 
appendages  this  pattern  is  directed.  Consider,  once  again,  the  example  of 
the  orchestral  conductor  discussed  in  the  introduction.  The  basic  rhythm 
(say,  for  a  brisk  march  having  two  beats  per  measure)  is  presumably  repre¬ 
sented  in  some  central  pattern  generator  (an  internal  metronome,  in  effect) 
that  can  be  activated  and  modulated  by  auditory,  visual,  and  other  cues 
(figure  13.11b).  For  any  given  rhythm,  the  conductor  must  transform  this 
beat  into  a  particular  motor  routine  (e.g.,  up-down  or  left-right  strokes,  or 
circular  strokes)  and  then  direct  this  routine  to  the  appropriate  digit(s)  or 
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Digit  1  Digit  2 


Figure  13.11  Control  of  information  flow  for  visual  attention  (a)  and  directed  motor  control 
(b).  The  direction  of  information  flow  is  bottom-up  for  sensory  processing  and  top-down 
for  motor  function,  but  in  both  cases  we  propose  the  need  for  separate  control  mechanisms 
to  regulate  the  flow  dynamically.  The  attentional  selection  process  tends  to  be  relatively 
transient,  whereas  directed  motor  routines  can  be  very  sustained. 

other  appendage(s)  (cf.  Keele  1981).  Note  that  information  flow  is  directed 
downward  in  this  scheme,  because  it  involves  top-down  communication 
from  high-level  centers  to  ones  that  are  closer  to  the  periphery. 

A  Simplified  Vector-Code  Scheme 

The  motor  system  is  extremely  complex  and  is  arguably  more  diverse  than 
the  visual  system  in  the  layout  of  its  major  components,  which  include  cere¬ 
bellar  as  well  as  cerebral  cortical  areas  and  numerous  subcortical  nuclei  in 
the  forebrain  (basal  ganglia),  thalamus,  midbrain,  and  brainstem.  It  has 
been  intensively  studied  using  anatomical,  physiological,  and  computa¬ 
tional  approaches  in  attempting  to  decipher  how  information  flows  and  is 
processed  in  this  complex  network  (cf.  Houk  et  al.  1993;  Hoover  and  Strick 
1993).  For  our  purpose  in  this  preliminary  analysis,  it  is  sufficient  to  focus 
on  a  highly  simplified  scheme  in  which  the  representation  at  both  middle 
and  lower  levels  in  figure  13.11b  involves  a  vector-code  strategy  similar  to 
that  proposed  by  Georgopolous  and  colleagues  for  primary  motor  cortex 
(Georgopolous  et  al.  1988,  1989).  In  this  scheme,  each  neuron  "votes" 
for  a  particular  direction  of  movement  for  the  portion  of  the  limb  that  it 
influences,  and  the  strength  of  its  vote  is  proportional  to  its  firing  rate. 
We  further  assume  that  there  is  an  orderly  representation  of  movement 
direction  across  the  cortical  surface.  In  the  example  illustrated,  neurons 
on  the  left  represent  upward  movement,  and  neurons  further  to  the  right 
represent  progressively  more  clockwise  directions  of  movement.  Thus,  the 
up-down  movements  schematized  in  figure  13.1  would  be  represented  as 
a  motor  routine  in  which  activity  would  alternate  between  neurons  repre- 
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senting  upward  and  downward  movement.  To  have  a  particular  digit  or 
other  appendage  execute  this  movement,  this  stereotyped  spatiotemporal 
pattern  would  be  selectively  routed  to  the  neural  populations  in  motor  cor¬ 
tex  that  control  the  relevant  digits.  By  having  direction  of  movement  be 
encoded  in  the  same  way  at  both  levels  in  this  scheme,  the  routing  process 
is  simplified  so  that  neurons  at  the  lower  level  can  project  in  a  1:1  mapping 
to  neurons  within  the  representation  for  each  appendage.  Note  that  for 
conceptual  clarity,  we  have  treated  the  representation  of  each  digit  or  other 
appendage  as  a  separate  neural  ensemble.  In  reality,  it  is  likely  that  these  are 
often  overlapping,  interleaved  ensembles  of  neurons  (Schieber  1990, 1992). 
However,  we  will  ignore  these  complexities  in  our  zeroth-order  model,  for 
the  same  reason  that  our  attention  model  involves  major  simplifications 
that  we  regard  as  not  crucial  to  the  initial  formulation  even  though  they  in 
due  course  will  necessitate  expansion  and  refinement  of  the  model. 

In  considering  how  this  scheme  might  be  implemented  in  the  primate 
motor  system,  it  is  important  to  identify  the  cortical  and/ or  subcortical 
structures  in  which  different  motor  routines  are  represented  and  the  major 
anatomical  pathways  along  which  this  information  is  transmitted.  Unfor¬ 
tunately,  these  basic  issues  are  not  well  understood.  One  interesting  ob¬ 
servation  is  that  primary  motor  cortex  is  at  a  lower  level  than  other  motor 
areas  in  an  anatomically  based  hierarchy  that  derives  from  the  laminar  pat¬ 
terns  in  which  different  pathways  originate  and  terminate  (Felleman  and 
Van  Essen  1991).  A  priori,  there  is  no  compelling  reason  why  this  need  be 
the  case,  as  opposed  to  a  situation  in  which  primary  motor  cortex  was  at 
the  pinnacle  of  a  sensory  input  to  motor  output  hierarchy.  The  observed 
relationship  fits  nicely  with  our  presumption  that  directed  motor  control 
involves  top-down  flow  of  information,  as  already  indicated  in  figure  13.11. 
However,  it  raises  the  puzzling  possibility  that  the  spatiotemporal  patterns 
associated  with  different  motor  routines  might  be  communicated  between 
cortical  areas  via  anatomically  descending  (feedback)  pathways,  whose 
terminations  preferentially  avoid  the  middle  layers  of  cortex.  This  would 
contrast  with  the  basic  pattern  of  information  flow  in  visual,  auditory,  and 
somatosensory  systems,  which  is  characterized  by  ascending  inputs  that 
preferentially  terminate  in  the  middle  layers  of  cortex  (see  Felleman  and 
Van  Essen  1991).  On  the  other  hand,  it  would  correspond  to  the  pattern 
of  information  flow  in  olfactory  cortex  and  related  areas,  where  ascending 
pathways  tend  to  terminate  in  layer  1  and  other  superficial  layers  (Haberly 
and  Price  1978;  Carmichael  1993). 

An  intriguing  alternative  is  that  information  about  a  given  motor  rou¬ 
tine  might  be  communicated  indirectly  between  cortical  areas,  via  the  well- 
known  route  involving  the  basal  ganglia  and  thalamus  (Alexander  et  al. 
1986),  rather  than  via  direct  corticocortical  connections.  Since  thalamic 
motor  nuclei  (VA  and  VL)  terminate  mainly  in  the  middle  layers  of  motor 
cortex  (Sloper  and  Powell  1979),  this  would  represent  a  better  match  to 
the  laminar  patterns  associated  with  ascending  information  flow  in  sen¬ 
sory  cortex.  In  short,  our  current  level  of  understanding  is  insufficient  to 
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determine  whether  the  information  for  encoding  a  given  motor  routine 
(1)  flows  through  cortical  areas  and  is  gated  by  thalamic  projections,  (2) 
flows  through  corticosubcortical  loops  and  is  gated  by  descending  cortico- 
cortical  pathways,  or  (3)  flows  in  a  very  different  way  that  does  not  involve 
the  explicit  gating  mechanisms  proposed  in  our  model. 

COGNITIVE  PROCESSING 

In  this  chapter,  we  have  concentrated  on  sensory  and  motor  processing, 
where  the  conceptual  issues  can  be  most  sharply  formulated  and  their  pos¬ 
sible  neurobiological  substrates  most  rigorously  analyzed.  However,  we 
suspect  that  similar  processing  strategies  apply  also  to  language  process¬ 
ing  and  to  other  cognitive  functions  as  well.  This  supposition  derives  from 
thinking  about  examples  such  as  the  following. 

Suppose  that  an  observer  is  asked  to  remember  the  identity  of  an  object 
(say,  an  apple)  whose  picture  is  briefly  flashed  on  a  screen.  After  a  brief 
delay,  the  observer  is  asked  to  report  the  identity  of  the  object  by  (1 )  stating 
its  name  verbally,  (2)  writing  its  name  on  a  sheet  of  paper,  or  (3)  signing 
its  name  in  American  Sign  Language.  In  another  variation  of  this  task, 
a  multilingual  observer  could  be  asked  to  state  the  name  of  the  object  in 
different  languages  (e.g.,  English,  French,  or  Polish).  In  situations  of  this 
type,  we  presume  that  information  about  the  object's  identity  is  stored  in  a 
restricted  number  of  central  representations  and  that  this  information  can 
be  routed  to  a  variety  of  different  output  modalities  and  in  a  variety  of 
different  formats.  For  the  purposes  of  this  argument,  it  does  not  matter 
whether  the  object's  identity  is  stored  as  a  visual  memory,  in  a  particular 
language-specific  format,  or  as  some  completely  abstract  cognitive  con¬ 
struct.  Whatever  the  format  of  the  representation,  the  brain  must  be  able 
to  access  the  information,  rapidly  translate  it  into  an  appropriate  format, 
and  transmit  it  to  the  appropriate  motor  structures.  We  have  already  ar¬ 
gued  that  the  sensory  and  motor  components  of  such  tasks  are  likely  to 
entail  dynamic  routing  strategies.  It  requires  only  a  modest  conceptual 
leap  to  suppose  that  analogous  routing  strategies  may  be  used  to  control 
the  flow  of  information  in  whatever  central  structures  are  used  to  represent 
semantic  information  and  other  high-level  abstractions  that  are  the  coinage 
of  cognitive  function. 

CONCLUDING  REMARKS 

Our  approach  to  formulating  models  of  cortical  function  has  been  guided 
by  a  number  of  computational  and  systems  engineering  considerations. 
Some  of  these  can  be  illustrated  by  the  following  analogy.  We  regard  the 
brain  as  a  system  designed  to  treat  information  as  an  essential  commodity, 
much  as  an  efficient  factory  is  designed  for  optimal  handling  of  the  phys¬ 
ical  materials  that  traffic  across  its  floors.  In  both  cases,  the  raw  materials 
that  enter  the  system  generally  represent  only  a  small  fraction  of  the  final 
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product.  The  production  process  involves  careful  selection  of  useful  mate¬ 
rials,  discarding  of  excess  or  unnecessary  materials,  and  transforming  and 
repackaging  of  the  desired  materials  in  an  appropriate  configuration  for 
the  particular  applications  for  which  the  product  is  intended.  For  efficient 
function,  the  flow  of  material  must  be  carefully  monitored  and  controlled. 
This  requires  specialized  systems  that  are  explicitly  designed  for  this  pur¬ 
pose,  rather  than  for  construction  and  fabrication  processes  per  se. 

In  the  nervous  system,  we  believe  that  an  important  general  strategy 
for  attaining  these  objectives  is  the  systematic  use  of  multiplicative  opera¬ 
tions,  whereby  one  set  of  inputs  from  a  class  of  "control  neurons"  dynam¬ 
ically  modulates  the  connections  between  two  other  groups  of  neurons. 
In  this  chapter  our  emphasis  has  been  on  the  modulation  of  feedforward 
pathways,  although  in  the  future  the  coordinated  modulation  of  the  corre¬ 
sponding  feedback  pathways  will  be  included  in  our  discussions.  Conven¬ 
tional  neural  network  models  typically  rely  on  computations  that  are  dom¬ 
inated  by  linear  combinations  of  synaptic  feedforward  inputs  followed  by 
a  nonlinear  operation.  This  simple  neural  network  structure  has  proven  to 
be  too  rigid  and  unwieldy  when  applied  to  large  problems.  More  complex 
nonlinearities  have  been  introduced  to  achieve  flexibility  in  neural  compu¬ 
tational  systems  using,  for  example,  dynamic  links  (von  der  Malsburg  and 
Bienenstock  1986)  or  oscillations  (Koch  and  Crick,  chapter  5;  Singer,  chap¬ 
ter  10).  While  these  models  have  encompassed  feedback  projections,  they 
have  kept  the  basic  input-output  structure  and  have  effectively  introduced 
what  we  would  call  control  functions  in  an  implicit  fashion,  rather  than  the 
explicit  fashion  we  favor.  We  suggest  that  models  that  do  not  distinguish 
control  functions  from  information  flow  and  processing  will  not  scale  well 
with  increased  problem  complexity. 

Explicit  three-way  interactions  provide  a  contextual  framework  for  mod¬ 
ifying  the  interpretation  of  information,  which  is  an  essential  ingredient 
for  cognitive  processing.  Hierarchical  interpretative  systems  can  be  de¬ 
signed  in  a  cleaner  more  efficient  fashion  using  a  substrate  of  modules 
specialized  in  terms  not  only  of  memory  and  processing,  but  control  as 
well.  Implementing  complex  circuits  of  the  type  described  in  this  chapter 
requires  their  physical  structure  to  be  largely  laid  down  by  genetic  fac¬ 
tors.  Numerous  anatomical  and  physiological  facts  support  the  picture 
of  a  rather  well-defined  structure  for  the  neocortex  and  its  connections  to 
subcortical  bodies  that  is  replicated  across  individuals  and  not  modified  on 
a  gross  scale  by  experience.  This  would  suggest  there  are  preprogrammed 
strategies  for  learning  how  to  interact  with  and  adapt  to  the  environment. 
Thus,  while  the  details  of  the  information  about  complex  objects  contained 
within  the  inferotemporal  cortex  differs  between  individuals,  the  strategies 
for  acquiring  that  information,  and  the  way  it  is  stored  and  retrieved,  is 
presumed  to  be  very  similar.  Our  brains  did  not  evolve  as  general  purpose 
computational  systems,  but  rather  to  compete  and  thrive  in  the  rich,  but 
restricted  context  of  human  society  and  the  world  we  live  in. 
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